Saturday, December 20, 2014

Classic Algorithms: Page Rank and the importance of web pages

Think of a search query, and go to your favorite search engine and conduct the search. Of the millions of matching documents returned, it is most likely that the first page or two contain what you are looking for. Have you ever wondered how the search engines know how to present the most relevant results for us in the first couple of pages, when all the results returned match the query we were looking for?

There are a lot of signals search engines use to determine the importance of the matched results: some are intrinsic to the page such as where in the page the match occurs, whether it is highlighted or not, and the importance of the web page; and some are extrinsic such as how many users clicked on the result for a similar query. The signals, their weights, and the formulas to rank the results are usually guarded secrets by search engines as they are the secret sauce for giving users the best results to their queries.

One such signal is the importance of a web page. How would you compute that? In 1996, Larry Page and Sergey Brin came up with a way called PageRank. The idea they proposed is that in a connected graph structure such as the world wide web, the importance of a web page is related to the importance of other web pages that point to it. If we denote the importance of a web page, or its PageRank by $r_i$, Page and Brin proposed that \[ r_i = \Sigma_{j \in N} \frac{r_j}{d_j}\] where $N$ is the set of pages that link to page $i$, $r_j$ is the PageRank for each of the pages that point to $r_i$, and $d_j$ is the number of outgoing links for each $r_j$ page.

If we write these equations for every node on the web, we come up with a system of simultaneous linear equations that we can solve, and get the PageRank for each web page. We can also write the simultaneous equations in vector and matrix notations as:
 \[ \mathbf{r} = \mathbf{M r} \]
where $\mathbf{r}$ is the vector of PageRanks for all the web pages, and $\mathbf{M}$ is the matrix whose column entries are either $0$ or $\frac{1}{d_j}$ for pages that link to page $i$.

It turns out that by rewriting the PageRank problem in this form, we discover that it resembles the eigenvalue/eigenvector problem in linear algebra, with the PageRank $\mathbf{r}$ as the eigenvector corresponding to an eigenvalue of $1$. There are many ways to solve the eigenvalue problem, and the power method is one of them: where we start with an initial value of the PageRank $\mathbf{r}^{(0)}$, and compute subsequent values through:
\[ \mathbf{r}^{(t+1)} = \mathbf{M} \mathbf{r}^{(t)}\]
until convergence occurs, i.e. the difference between the PageRanks in two subsequent iterations becomes very small:
 \[ \| \mathbf{r}^{(t+1)} - \mathbf{r}^{(t)} \|_n < \epsilon \]
where $\|\cdots\|_n$ is the $L_n$ norm, and $\epsilon > 0$. Typically the Euclidean or $L_2$ norm is used.

There is a wrinkle in the simultaneous equations or eigenvalue/eigenvector formulations of the PageRank though: not all simultaneous equations or eigenvalue/eigenvector problems are well posed to have a solution. This happens for example when you have linear dependence between the equations or rows/columns of the matrix. An illustrative example would be a sink page, which does not have any outbound links to any other pages. Its effect on the PageRanks of other pages would be zero, and all the respective columns in the matrix $\mathbf{M}$ associated with the page would be $0$. Another example would be a page that links to itself, where during the power iteration, all the other page PageRanks are leaked to $0$. How would you tackle such problems? Through a bit of creativity.

It turns out that if we model browsing the Internet as a random walk over its connected graph structure,  where a random surfer moves from page $i$ to another page, by following one of the outbound links from page $i$, we arrive at a similar formulation

\[ \mathbf{r}=\mathbf{Mr} \]
as before.

Of course the random surfer would not stop when there are no outbound links in the page they are at, and would whimsically go to any other page on the Internet. The model is then augmented to accommodate such behavior:

\[ \mathbf{r} = \beta \mathbf{Mr} + \frac{(1-\beta)}{N}\mathbf{r} \]

where $0<\beta<1$, and $N$ is the total number of pages in the web graph. The first term in the equation models the random surfer picking one of the outbound links at random to visit a subsequent page, and the second term models them jumping to any other page on the Internet at random.

We can rewrite the prior formulation as:

\[\mathbf{r} = \mathbf{M'r} \]

which is well formed, and has a solution since $\mathbf{M'}$ is a stochastic matrix. Using the power method on the last equation will yield the PageRanks $\mathbf{r}$ of all the web pages in the Internet.

Of course that computation requires a lot of engineering magic, since there are billions of web pages on the Internet, and each have many outbound connections.

No comments :

Post a Comment