The Web

Link Analysis and Ranking of Search Results

Ranking documents by importance is a hard problem by any measure. Parts of the problem of attaching importance to the content of a document come from general properties of natural languages, like synonymy (there are multiple ways to say the same thing) and polysemy (one term may have several distinct meanings). In the case of a network of documents, the link structure can be used to provide additional clues for the purpose of ranking or, in the extreme case, provide the only source of criteria for ranking purposes.

Hubs and Authorities

In a network of nodes connected by directed edges, each node plays two different roles, one as a receiver of links, and one as a sender of links. A first measure of importance, or recognition, of a node in this network might be the number of links it receives, i.e., its in-degree in the underlying graph. If in a collection of web pages relating to a search query on the subject of “networks”, say, a particular page receives a high number of links, this page might count as an authority on that subject, with authority score measured by its in-degree.

In turn, the web pages linking to an authority in some sense know where to find valuable information and thus serve as good “lists” for the subject. A high value list is called a hub for this query. It makes sense to measure the value of a page as list in terms of the values of the pages it points at, by assigning to its hub score the sum of the authority scores of the pages it points at.

Example.

hubs

Now the authority score of a page could take the hub scores of the list pages into account, by using the sum of the hub scores of the pages that point at it as an updated authority score.

Then again, applying the Principle of Repeated Improvement, the hub scores can be updated on the basis of the new authority scores, and so on.

This suggests a ranking procedure that tries to estimate, for each page $p$, its value as an authority and its value as a hub in the form of numerical scores, $a(p)$ and $h(p)$.

Starting off with values all equal to $1$, the estimates are updated by applying the following two rules in an alternating fashion.

Authority Update Rule: For each page $p$, update $a(p)$ to be the sum of the hub scores of all the pages pointing to it.

Hub Update Rule: For each page $p$, update $h(p)$ to be the sum of the authority scores of all the pages that it points to.

In order to keep the numbers from growing too large, score vectors should be normalized after each step, in a way that replaces $h$ by a scalar multiple $\hat{h} = sh$ so that the entries in $\hat{h}$ add up to $100$, say, representing relative percentage values, similarly for $a$.

After a number of iterations, the values $a(p)$ and $h(p)$ stabilize, in the sense that further applications of the update rules do not yield essentially better relative estimates.

Example. Continuing the example above, the following table lists the normalized hub scores of the $9$ hub nodes over the first seven iterations and, in the bottom row, the limit values.

hubs-h

The following table lists the normalized authority scores of the $7$ authority nodes over the first seven iterations and, in the bottom row, the limit values.

hubs-a

Eventually, the scores in this network converge to the values attached to the notes in this diagram.

hubs0

In terms of matrix algebra this effect can be described as follows.

Spectral Analysis of Hubs and Authorities

Let $M = (m_{ij})$ be the adjacency matrix of the directed graph $G = (X, E)$ that is $m_{ij} = 1$ if $x_i \to x_j$ and $m_{ij} = 0$ otherwise, where $X = \{x_1, \dots, x_n\}$.

We write $h = (h_1, \dots, h_n)$ for a list of hub scores, with $h_i = h(x_i)$, the hub score of node $x_i$. Similarly, we write $a = (a_1, \dots, a_l)$ for a list of authority scores.

The hub update rule can now be expressed as a matrix multiplication:

and similarly, the authority update rule, using the transpose of the matrix $M$:

Applying two steps of the procedure at once yields update rules

and

for $h$ and $a$, respectively. In the limit, one expects to get vectors $h^{\ast}$ and $a^{\ast}$ whose directions do not change under the latter rules, i.e.,

and

for constants $c$ and $d$, meaning that $h^{\ast}$ and $a^{\ast}$ are eigenvectors for the matrices $M M^T$ and $M^T M$, respectively.

Using the fact that $M M^T$ and $M^T M$ are symmetric matrices ($(M M^T)^T = (M^T)^T M^T = M M^T$), it can indeed be shown that any sequence of hub score vectors $h$ under repeated application of the above update rule converges to a real-valued eigenvector $h^{\ast}$ of $M M^T$ for the real eigenvalue $c$. A similar result exists for any sequence of authority score vectors $a$.

PageRank

A simpler model of endorsement for web pages involves only one numerical value $r(p)$ per page $p$, built on the principle that a page is as important as the pages linking to it. As before, these importance values can be obtained by repeatedly applying a suitable update rule to a set of current values.

Specifically, PageRank is computed as follows.

  • If the network has $n$ nodes, each page $p$ receives an initial PageRank of $r(p) = 1/n$.

  • Choose a number of steps, $k$.

  • Perform the following update rule $k$ times.

Basic PageRank Update Rule: Each page divides its current PageRank by the number of pages it links to, and passes this value on to those pages. In this way, each page updates its PageRank to be the sum of all the shares it receives from the pages linking to it.

As in each step, the total PageRank of all nodes is maintained (each node splits its PageRank into equal parts and passes this on, nothing is lost or gained overall), there is no need for normalization.

After a number of steps, the PageRank values of the individual nodes stabilize. These values form an equilibrium in the sense that another application of the update rule will produce exactly the same values.

Example. The following graph represents a network of $8$ web pages with hyperlinks.

pagerank

The following table shows how the initial PageRank of $\frac18$ of each page is updated under six iterations of the basic PageRank update rule and, in the bottom row, the limit values.

pagerank-p

In terms of matrix algebra this effect can be described as follows.

Spectral Analysis of PageRank

Here, we use a variant of the adjacency matrix $M$ of the directed graph $G = (X, V)$. Let $N$ be the $n \times n$ matrix with entries $N_{ij} = 0$ if node $x_i$ is not linked to node $x_j$ (as for the adjacency matrix $M$). And when $x_i \to x_j$, then set $N_{ij} = 1/l_i$, where $l_i$ is the number of links out of $x_i$.

Write $r = (r_1, \dots, r_n)$ for the list of PageRank values of the nodes $X = \{x_1, \dots, x_n\}$. Then the basic PageRank update rule can be expressed as matrix multiplication:

where $N^T$ is the transpose of the matrix $N$.

It can be shown that repeated application of the basic PageRank update rule lets the PageRank values converge towards a vector $r^{\ast}$ with the property

that is, $r^{\ast}$ is an eigenvector of $N^T$ for the eigenvalue $1$. The argument uses a theorem from Linear Algebra, the Perron-Frobenius Theorem which, for a matrix in which all entries are non-negative (such as the matrix $N^T$) guarantees the existence of a real eigenvalue with corresponding eigenvector having non-negative entries (a property that is not true in general).