Skip to main content

Classic Algorithms: Page Rank and the importance of web pages

Think of a search query, and go to your favorite search engine and conduct the search. Of the millions of matching documents returned, it is most likely that the first page or two contain what you are looking for. Have you ever wondered how the search engines know how to present the most relevant results for us in the first couple of pages, when all the results returned match the query we were looking for?

There are a lot of signals search engines use to determine the importance of the matched results: some are intrinsic to the page such as where in the page the match occurs, whether it is highlighted or not, and the importance of the web page; and some are extrinsic such as how many users clicked on the result for a similar query. The signals, their weights, and the formulas to rank the results are usually guarded secrets by search engines as they are the secret sauce for giving users the best results to their queries.

One such signal is the importance of a web page. How would you compute that? In 1996, Larry Page and Sergey Brin came up with a way called PageRank. The idea they proposed is that in a connected graph structure such as the world wide web, the importance of a web page is related to the importance of other web pages that point to it. If we denote the importance of a web page, or its PageRank by $r_i$, Page and Brin proposed that \[ r_i = \Sigma_{j \in N} \frac{r_j}{d_j}\] where $N$ is the set of pages that link to page $i$, $r_j$ is the PageRank for each of the pages that point to $r_i$, and $d_j$ is the number of outgoing links for each $r_j$ page.

If we write these equations for every node on the web, we come up with a system of simultaneous linear equations that we can solve, and get the PageRank for each web page. We can also write the simultaneous equations in vector and matrix notations as:
 \[ \mathbf{r} = \mathbf{M r} \]
where $\mathbf{r}$ is the vector of PageRanks for all the web pages, and $\mathbf{M}$ is the matrix whose column entries are either $0$ or $\frac{1}{d_j}$ for pages that link to page $i$.

It turns out that by rewriting the PageRank problem in this form, we discover that it resembles the eigenvalue/eigenvector problem in linear algebra, with the PageRank $\mathbf{r}$ as the eigenvector corresponding to an eigenvalue of $1$. There are many ways to solve the eigenvalue problem, and the power method is one of them: where we start with an initial value of the PageRank $\mathbf{r}^{(0)}$, and compute subsequent values through:
\[ \mathbf{r}^{(t+1)} = \mathbf{M} \mathbf{r}^{(t)}\]
until convergence occurs, i.e. the difference between the PageRanks in two subsequent iterations becomes very small:
 \[ \| \mathbf{r}^{(t+1)} - \mathbf{r}^{(t)} \|_n < \epsilon \]
where $\|\cdots\|_n$ is the $L_n$ norm, and $\epsilon > 0$. Typically the Euclidean or $L_2$ norm is used.

There is a wrinkle in the simultaneous equations or eigenvalue/eigenvector formulations of the PageRank though: not all simultaneous equations or eigenvalue/eigenvector problems are well posed to have a solution. This happens for example when you have linear dependence between the equations or rows/columns of the matrix. An illustrative example would be a sink page, which does not have any outbound links to any other pages. Its effect on the PageRanks of other pages would be zero, and all the respective columns in the matrix $\mathbf{M}$ associated with the page would be $0$. Another example would be a page that links to itself, where during the power iteration, all the other page PageRanks are leaked to $0$. How would you tackle such problems? Through a bit of creativity.

It turns out that if we model browsing the Internet as a random walk over its connected graph structure,  where a random surfer moves from page $i$ to another page, by following one of the outbound links from page $i$, we arrive at a similar formulation

\[ \mathbf{r}=\mathbf{Mr} \]
as before.

Of course the random surfer would not stop when there are no outbound links in the page they are at, and would whimsically go to any other page on the Internet. The model is then augmented to accommodate such behavior:

\[ \mathbf{r} = \beta \mathbf{Mr} + \frac{(1-\beta)}{N}\mathbf{r} \]

where $0<\beta<1$, and $N$ is the total number of pages in the web graph. The first term in the equation models the random surfer picking one of the outbound links at random to visit a subsequent page, and the second term models them jumping to any other page on the Internet at random.

We can rewrite the prior formulation as:

\[\mathbf{r} = \mathbf{M'r} \]

which is well formed, and has a solution since $\mathbf{M'}$ is a stochastic matrix. Using the power method on the last equation will yield the PageRanks $\mathbf{r}$ of all the web pages in the Internet.

Of course that computation requires a lot of engineering magic, since there are billions of web pages on the Internet, and each have many outbound connections.


Comments

Popular posts from this blog

Kindle Paperwhite

I have always been allergic to buying specialized electronic devices that do only one thing, such as the Kindle, the iPod, and fitness trackers. Why buy these when technology evolves so fast that a multi-purpose device such as the phone or a smart watch can eventually do the same thing, but with the convenience of updates that fix bugs and add functionality? So, I was shocked when this weekend I made an impulse buy and got the newest Kindle Paperwhite—a special purpose device for reading eBooks. I was walking past the Amazon store in the mall and saw that the newest Kindle Paperwhites were marked down by $40 for the holidays. The device looked good in the display, so I went in to look at it closely. The Paperwhite is small and light, with a 6” screen that is backlit and waterproof.   The text was crisp and readable, and in the ambient light, it felt like I am reading a printed book. I was sold and bought it on the spot. At home I have struggled to put it down. The bo...

A paper a day keeps the doctor away: NoDB

In most database systems, the user defines the shape of the data that is stored and queried using concepts such as entities and relations. The database system takes care of translating that shape into physical storage, and managing its lifecycle. Most of the systems store data in the form of tuples, either in row format, or broken down into columns and stored in columnar format. The system also stores metadata associated with the data, that helps with speedy retrieval and processing. Defining the shape of the data a priori, and transforming it from the raw or ingestion format to the storage format is a cost that database systems incur to make queries faster. What if we can have fast queries without incurring that initial cost? In the paper " NoDB: Efficient Query Execution on Raw Data Files ", the authors examine that question, and advocate a system (NoDB) that answers it. The authors start with the motivation for such a system. With the recent explosion of data...

A paper a day keeps the dr away: Dapper a Large-Scale Distributed Systems Tracing Infrastructure

Modern Internet scale applications are a challenge to monitor and diagnose. The applications are usually comprised of complex distributed systems that are built by multiple teams, sometimes using different languages and technologies. When one component fails or misbehaves, it becomes a nightmare to figure out what went wrong and where. Monitoring and tracing systems aim to make that problem a bit more tractable, and Dapper, a system by Google for large scale distributed systems tracing is one such system. The paper starts by setting the context for Dapper through the use of a real service: "universal search". In universal search, the user types in a query that gets federated to multiple search backends such as web search, image search, local search, video search, news search, as well as advertising systems to display ads. The results are then combined and presented back to the user. Thousands of machines could be involved in returning that result, and any poor p...