Graphs and Social Networks

Shortest Paths? Breadth First Search!

Shortest Paths

Recall that a path in a network $G = (X, E)$ is a sequence $p = (x_0, x_1, \dots, x_k)$ of nodes $x_i \in X$, $i = 0, \dots, k$, such that any pair of consecutive nodes forms an edge in $G$, i.e., $\{x_{i-1}, x_i\} \in E$ for all $i = 1, \dots, k$. The length $l(p)$ of the path $p$ is the number of edges, $l(p) = k$.

In many practical applications it is of interest to find for a pair $x, y$ of nodes, one or all the paths form $x$ to $y$ connecting the two nodes with the fewest number of edges possible. This is a more complex measure on a network than, say, the degree of a node, and we will need a more complex procedure, that is: an algorithm, in order to answer such questions systematically.

Let’s start with a proper definition.

Definition. Let $G = (X, E)$ be a simple graph and let $x, y \in X$. Let $P(x, y)$ be the set of all paths from $x$ to $y$. Then the distance $d(x, y)$ from $x$ to $y$ is

$d(x, y) = \min \{ l(p) : p \in P(x, y) \}$,

the shortest possible length of a path from $x$ to $y$, and a shortest path from $x$ to $y$ is a path $p \in P(x, y)$ of length $l(p) = d(x, y)$.

The diameter $\mathrm{diam}(G)$ of the network $G$ is the length of the longest shortest path between any two nodes,

.

Now we consider the following problem: Given a node $x \in X$, what are the distances $d(x, y)$ for all nodes $y \in X$? A systematic procedure for finding these distances, and the shortest paths through which they are realized, is given by the algorithm which is know in Computer Science as Breadth First Search (BFS).

In order to describe the algorithm step by step, let’s call a node $y$ a neighbor (or friend) of node $x$, if $\{x, y\}$ is an edge, and let’s denote by

the set of all neighbors of node $x$. The algorithm works through the network layer by layer, starting with the given vertex $x$ at layer $0$ and all its friends at layer $1$. It then finds the friends of the friends at layer $2$, and so on, until every node that can be reached from $x$ by a path has been recorded, taking care that no node gets recorded twice. The layer of a node then corresponds to its distance from the given node $x$.

We will see two versions of this algorithm, a simple one first, and then a more efficient one. In both cases, the distance results will be collected in an array $(d_1, \ldots, d_n)$. In the first version, we use a variable $d$ to keep track of the current distance from $x$.

Breadth First Search, Version 0. Given a simple graph $G = (X, E)$ and a vertex $x \in X$, determine $d(x, y)$ for all nodes $y \in X$.

  1. [Initialize.] Suppose that $X = \{x_1, x_2, \ldots, x_n\}$ and that $x = x_j$. Set $d_i \gets \perp$ (undefined) for $i = 1, \dots, n$. Set $d_j \gets 0$ and $d \gets 0$.

  2. [Loop.] For each vertex $x_k \in X$ with $d_k = d$:
    • for each neighbor $x_l \in N(x_k)$ with $d_l = \perp$:
      • set $d_l \gets d + 1$.
  3. [Stop?] Stop if no neighbor $x_l$ with $d_l = \perp$ was found, and return the array $(d_1, \dots, d_n)$. Otherwise set $d \gets d + 1$ and repeat from step 2.

Example. Consider the following network.

bfs1

In order to determine the distance from node $A$ to each node in the network, we set up an array of distance values which initially looks like this:

Now, with distance variable $d = 0$, we find all nodes with distance set to $0$ in the table (that is $A$ only), then find their neighbors (that is nodes $B$, $C$, $D$ and $E$) and select those whose distance value is currently undefined (which is the case for all of $B$, $C$, $D$ and $E$). Setting their distance to $d+1$ gives a new array like this:

Now, we increment the value of $d$ to $1$ and start all over: find all nodes with distance set to $1$ in the table (that is now nodes $B$, $C$, $D$ and $E$), then find their neighbors (that is nodes $A$, $B$, $C$, $F$, $G$, $H$) and select those whose distance value is currently undefined (which is the case for $F$, $G$, and $H$). Setting their distance to $d+1$ gives a new array like this:

Increment $d$ to $2$ and start again: find all nodes with distance set to $2$ in the table (that is now nodes $F$, $G$, $H$), then find their neighbors (that is nodes $B$, $C$, $D$, $E$, $I$ and $J$) and select those whose distance value is currently undefined (which is the case for $I$ and $J$). Setting their distance to $d+1$ gives a new array like this:

Once more, we increment $d$ to $3$, find all nodes with distance set to $3$ in the table (that is now nodes $I$ and $J$), then find their neighbors (that is nodes $F$, $G$, $H$ and $K$) and select those whose distance value is currently undefined (which is the case only for node $K$). Setting their $K$’s distance to $d+1$ gives a new array like this:

Now all the entries in the array are filled and we can stop (although, technically the algorithm instructs us to set up one more round with $d = 4$, only to find that no more neighbors with undefined distance values can be found). The array now contains, for each node $y$ in the connected component of node $A$ the distance $d(A, x)$.

The workings of the algorithm can be used to make a layer-by-layer diagram of the network, as follows.

bfs2

Complexity. It can be shown that this version of the algorithm has complexity $O(n^2)$, meaning that the time it needs to perform its task grows with the square of the size of the network.

In the second version of the algorithm, we use a queue (a first-in first-out buffer) to keep track of the node whose neighbors are currently under consideration. A queue is an array of values that comes with two basis operations: one can push a value to the end of the queue, and one can pop a value of the top of the queue (provided the queue is not empty). It can be shown that this version of the algorithm in the common case a sparse network has complexity $O(n)$, which is as good as one could hope for.

Breadth First Search, Version 1. Given a simple graph $G = (X, E)$ and a vertex $x \in X$, determine $d(x, y)$ for all nodes $y \in X$.

  1. [Initialize.] Suppose that $X = \{x_1, x_2, \ldots, x_n\}$ and that $x = x_j$. Set $d_i \gets \perp$ (undefined) for $i = 1, \dots, n$. Set $d_j \gets 0$ and initialize a queue $Q \gets (x_j)$.

  2. [Loop.] While $Q \neq \emptyset$:
    • pop node $x_k$ off $Q$
    • for each neighbor $x_l$ of $x_k$ with $d_l = \perp$:
      • push $x_l$ onto $Q$ and set $d_l \gets d_k + 1$.
  3. [Stop.] Return the array $(d_1, \dots, d_n)$.

Example. Using the same example as before, we start off with a queue $Q = (A)$ and an array of distance values initialized as

Now we repeat step 2 until we run out of nodes in the queue.

First, $A$ gets popped off the queue (leaving $Q = ()$ empty for now) and its neighbors $B$, $C$, $D$ and $E$ in turn are pushed onto the queue and get distance $1$ assigned. At this point $Q = (B, C, D, E)$ and the distance array looks like

Next, $B$ gets popped off the queue (resulting in $Q = (C, D, E)$) and of its neighbors $A$, $C$ and $F$ only node $F$ is pushed onto the queue, with distance $2$ assigned (that is $B$’s distance $1$, plus $1$). Now $Q = (C, D, E, F)$ and the distance array has

And so on, until finally, with $Q = (K)$, node $K$ is popped off, no new neighbors are added, and the algorithm terminates with an empty queue (and the same completed distance array as before) on the next iteration.

Variants. BFS is an extremely versatile algorithm, which applies in many different situations and can be adapted to produce additional information on a network.

For example, BFS run on a node $x$ in a network $G = (X, E)$ determines the connected component of $X$ in $G$ (as the set of all nodes that get a distance value assigned).

With little more work (and an additional array) BFS can produce a spanning tree (or shortest path tree). Here, whenever node $x_l$ is pushed onto $Q$, it is assigned the current node $x_k$ (in the additional array) as its predecessor on a shortest path from $x_j$ to $x_l$. The subgraph of the network consisting of these edges is a tree. As a tree, it has exactly one path between the given node $x$ and any of its vertices $y$ and, by construction, this path is a shortest path between $x$ and $y$.

Example. Continuing the above example, the egdes of the spanning tree here are highlighted in blue.

bfs3

The corresponding array lists for every node (except for $A$) its parent in the rooted tree with root $A$.

A further small modification of BFS makes it possible to record all shortest paths between the given node $x$ and the vertices $y$ in its connected component.

Of course, in order to find distances, or shortest paths between all pairs of nodes $x$ and $y$ in a network, one can perform BFS for each of the vertices $x \in X$ in turn.

The algorithm and its variants also works on directed networks, but the results then will have to be interpreted in the context of directed networks.

More about BFS can be found in [Newman, Section 10.3].