Saturday, December 27, 2014

Randomized algorithms: Verifying matrix multiplication

Another great example of randomized algorithms in the introductory chapters of the book "Probability and Computing" is verifying matrix multiplication. Suppose we have 3 matrices of compatible dimensions $A, B$, and $C$, and we want to verify that
\[A\cdot B = C\]

For simplicity, let's assume that all the matrices are square, and of dimension $n \times n$. The straightforward way to do the verification is to explicitly multiple the matrices $A$ and $B$ together, an operation that is of the order of magnitude of $O(n^3)$.

As an aside, we can do better than $O(n^3)$ for matrix multiplication for larger matrices. Wikipedia has an excellent writeup on Strassen's algorithm which accomplishes matrix multiplication in $O(n^{2.8074})$ through divide and conquer. First the matrices $A$, $B$, and $C$ are padded with zero columns and rows so that their dimensions are of the form $2^m \times 2^m$. The matrices are then divided into equally sized  block matrices of the form:
\[ A = \left[ \begin{array}{cc} A_{1,1} & A_{1,2} \\ A_{2,1} & A_{2,2} \end{array} \right] \]
\[ B = \left[ \begin{array}{cc} B_{1,1} & B_{1,2} \\ B_{2,1} & B_{2,2} \end{array} \right] \]
\[ C = \left[ \begin{array}{cc} C_{1,1} & C_{1,2} \\  C_{2,1} & C_{2,2} \end{array} \right] \]
Then
\[ C_{i,j} = \Sigma_{k=1}^2 A_{i,k} B_{k,j} \]
The trick of the Strassen's algorithm is decreasing the number of required multiplications (8) by defining the auxiliary matrices $M_i$ such that:
\[ \begin{eqnarray}
M_1 & = & (A_{1,1}+A_{2,2})(B_{1,1}+B_{2,2}) \\
M_2 & = & (A_{2,1}+A_{2,2}) B_{1,1} \\
M_3 & = & A_{1,1}(B_{1,2}-B_{2,2}) \\
M_4 & = & A_{2,2}(B_{2,1}-B_{1,1}) \\
M_5 & = & (A_{1,1}+A_{1,2})B_{2,2}\\
M_6 & = & (A_{2,1} - A_{1,1})(B_{1,1}+B_{1,2}) \\
M_7 & = & (A_{1,2} - A_{2,2})(B_{2,1}+B_{2,2})
\end{eqnarray}
\]

which have $7$ multiplications instead, and expressing the result in terms of the  $M$ matrices as:
\[
\begin{eqnarray}
C_{1,1} & = & M_1 + M_4 - M_5 + M_7\\
C_{1,2} & = & M_3 + M_5 \\
C_{2,1} & = & M_2 + M_4 \\
C_{2,2} & = & M_1 - M_2 + M_3 + M_6
\end{eqnarray}
\]

The algorithm continues to divide matrices $A$, $B$, and $C$ into smaller blocks in a similar manner until there is only one element, and the multiplication becomes straightforward.

There are faster algorithms for matrix multiplication, such as the Coppersmith–Winograd algorithm with complexity of $O(n^{2.375477})$ and yet faster ones with $O(n^{2.3728639})$, but what if these are not fast enough?

Randomized algorithms come to the rescue. The idea is to choose a vector $r$ of dimension $n$, whose elements $r_i \in \{0,1\}$ and compute the quantities $ABr$ and $Cr$. If $ABr \ne Cr$ then $AB \ne C$ and the verification of matrix multiplication is done.

On the other hand if $ABr = Cr$ then either $AB=C$ or we might have a false positive. The false positive occurs when $(AB-C)r=0$, with the first term in the equation $AB-C$ not zero. The book computes the probability of this happening as less than $\frac{1}{2}$. The argument goes as follows.

Define the first term as a matrix $D=AB-C$. Since $D \ne 0$ in the false positive case, it must contain a non-zero element somewhere in the matrix. Assume this element is $d_{11}$.  We can then write $Dr=0$ as the set of $n$ simultaneous equations:

\[ \Sigma_{j=1}^{n} d_{ij}r_j=0 \]

Or for $r_1$ in the false positive case

\[ r_1 = -\frac{\Sigma_{j=2}^{n} d_{1j}r_j}{d_{11}} \]

Using the principle of deferred decisions we can figure out the probability of the false positive. We start by selecting $r_i$ for $i \in \{2, n\}$ randomly and independently from the set $\{0,1\}$, and we pause before selecting $r_1$ from the same set.  Since $r_1$ can have one of two values $\{0,1\}$, the probability of selecting $r_i$ that satisfies the prior equation to give us the false positive case is $\frac{1}{2}$

Since a false positive probability of $\frac{1}{2}$ is not good enough, we can repeat the algorithm with a new random vector $r$. After $k$ repetitions of the algorithm the false positive probability drops to $\frac{1}{2^k}$, which we can make small to our liking with many repetitions. The cost of the verification algorithm is on the order of $O(kn^2)$ multiplications, which is better than the fastest direct algorithm for explicit matrix multiplication.



Saturday, December 20, 2014

Classic Algorithms: Page Rank and the importance of web pages

Think of a search query, and go to your favorite search engine and conduct the search. Of the millions of matching documents returned, it is most likely that the first page or two contain what you are looking for. Have you ever wondered how the search engines know how to present the most relevant results for us in the first couple of pages, when all the results returned match the query we were looking for?

There are a lot of signals search engines use to determine the importance of the matched results: some are intrinsic to the page such as where in the page the match occurs, whether it is highlighted or not, and the importance of the web page; and some are extrinsic such as how many users clicked on the result for a similar query. The signals, their weights, and the formulas to rank the results are usually guarded secrets by search engines as they are the secret sauce for giving users the best results to their queries.

One such signal is the importance of a web page. How would you compute that? In 1996, Larry Page and Sergey Brin came up with a way called PageRank. The idea they proposed is that in a connected graph structure such as the world wide web, the importance of a web page is related to the importance of other web pages that point to it. If we denote the importance of a web page, or its PageRank by $r_i$, Page and Brin proposed that \[ r_i = \Sigma_{j \in N} \frac{r_j}{d_j}\] where $N$ is the set of pages that link to page $i$, $r_j$ is the PageRank for each of the pages that point to $r_i$, and $d_j$ is the number of outgoing links for each $r_j$ page.

If we write these equations for every node on the web, we come up with a system of simultaneous linear equations that we can solve, and get the PageRank for each web page. We can also write the simultaneous equations in vector and matrix notations as:
 \[ \mathbf{r} = \mathbf{M r} \]
where $\mathbf{r}$ is the vector of PageRanks for all the web pages, and $\mathbf{M}$ is the matrix whose column entries are either $0$ or $\frac{1}{d_j}$ for pages that link to page $i$.

It turns out that by rewriting the PageRank problem in this form, we discover that it resembles the eigenvalue/eigenvector problem in linear algebra, with the PageRank $\mathbf{r}$ as the eigenvector corresponding to an eigenvalue of $1$. There are many ways to solve the eigenvalue problem, and the power method is one of them: where we start with an initial value of the PageRank $\mathbf{r}^{(0)}$, and compute subsequent values through:
\[ \mathbf{r}^{(t+1)} = \mathbf{M} \mathbf{r}^{(t)}\]
until convergence occurs, i.e. the difference between the PageRanks in two subsequent iterations becomes very small:
 \[ \| \mathbf{r}^{(t+1)} - \mathbf{r}^{(t)} \|_n < \epsilon \]
where $\|\cdots\|_n$ is the $L_n$ norm, and $\epsilon > 0$. Typically the Euclidean or $L_2$ norm is used.

There is a wrinkle in the simultaneous equations or eigenvalue/eigenvector formulations of the PageRank though: not all simultaneous equations or eigenvalue/eigenvector problems are well posed to have a solution. This happens for example when you have linear dependence between the equations or rows/columns of the matrix. An illustrative example would be a sink page, which does not have any outbound links to any other pages. Its effect on the PageRanks of other pages would be zero, and all the respective columns in the matrix $\mathbf{M}$ associated with the page would be $0$. Another example would be a page that links to itself, where during the power iteration, all the other page PageRanks are leaked to $0$. How would you tackle such problems? Through a bit of creativity.

It turns out that if we model browsing the Internet as a random walk over its connected graph structure,  where a random surfer moves from page $i$ to another page, by following one of the outbound links from page $i$, we arrive at a similar formulation

\[ \mathbf{r}=\mathbf{Mr} \]
as before.

Of course the random surfer would not stop when there are no outbound links in the page they are at, and would whimsically go to any other page on the Internet. The model is then augmented to accommodate such behavior:

\[ \mathbf{r} = \beta \mathbf{Mr} + \frac{(1-\beta)}{N}\mathbf{r} \]

where $0<\beta<1$, and $N$ is the total number of pages in the web graph. The first term in the equation models the random surfer picking one of the outbound links at random to visit a subsequent page, and the second term models them jumping to any other page on the Internet at random.

We can rewrite the prior formulation as:

\[\mathbf{r} = \mathbf{M'r} \]

which is well formed, and has a solution since $\mathbf{M'}$ is a stochastic matrix. Using the power method on the last equation will yield the PageRanks $\mathbf{r}$ of all the web pages in the Internet.

Of course that computation requires a lot of engineering magic, since there are billions of web pages on the Internet, and each have many outbound connections.


Thursday, December 11, 2014

Randomized Algorithms: Polynomial equivalence

We are all accustomed to deterministic algorithms; we work with them every day, and feel comfortable in knowing that the results of running them are predictable, barring a coding error of course. The idea of randomized algorithms feels remote and uncomfortable, despite their usefulness and elegance.  There are a couple of great examples in the introductory chapters of the book "Probability and Computing" that are an eye opener.

One is verifying polynomial identities: how can you tell that two different representations of polynomials are the same? For example, if we have two polynomials $P(x)$ and $Q(x)$, both of degree $d$ described by the following formulas:
\[
P(x) = \Sigma_{i=0}^{i=d} a_i x^i \\
Q(x) = \Pi_{i=1}^{i=d} (x-b_i)
\]

how can we determine that they are the same polynomial?

Intuitively we first check that the degrees are the same, then we could try to transform one form into the other, either by multiplying out the terms for $Q(x)$, collecting like terms and reducing it to $P(x)$, or finding the $d$ roots $r$ of $P(x)$, and expressing it as a product of $d$ terms of the form $(x-r_i)$ and comparing them to $Q(x)$. The first approach is easier, but can we do better?

The book presents a randomized algorithm to do the same. What if we pick a random number $r$ from the range $[0, Nd]$, where $N$ is a natural number greater than zero. For example if $N$ is $100$, we pick a random number $r$ from the range $[0,100d]$, and evaluate $P(r)$ and $Q(r)$, which could be done in $O(d)$ time. What would the result tell us?

If $P(r) \ne Q(r)$ then the polynomials are not the same. If $P(r) = Q(r)$, then there is a chance that the polynomials are equivalent, but there is also a chance that they are not, and that $r$ in this case is a root of the equation $P(x)-Q(x)=0$. The chance that we picked an $r$ that satisfies the last equation is no more than $1/N$---($1/100$ in the concrete example).

How do we minimize that chance? By repeating the evaluation by drawing another random value $r$ from the interval $[0,Nd]$. The book describes the probability of producing a false result as the evaluations are repeated with and without replacement, and they are less than $(1/N)^k$, where $k$ is the number of evaluations.

Isn't it quite elegant to find out that two polynomials are the same or not simply through repeated evaluations, and not through algebraic manipulations to transform either or both to a canonical form?

Monday, December 8, 2014

Down memory lane: Approximate counting

At work we do a lot of counting. For some counts we need an accurate result, while for others we can get by with an approximate count. For these approximate counts, as long as the result is within the same order of magnitude of the true count, we're ok.

There is a lot of literature on approximate counting techniques, and the blog post "Probabilistic Data Structures for Web Analytics and Data Mining" does a great job at explaining some of them in detail, with references to the original papers.

One of the original approximate counting papers was the 1978 paper: Counting large numbers of events in small registers by Robert Morris. In the paper Morris explains the context for the problem, which might seem foreign today. Back in 1978 he was faced with the problem of counting a large number of different events, and because of memory restrictions he could only use 8-bit registers for each counter. Since he was not interested in exact counts, but with accurate enough counts, he came with the technique of probabilistic counting.

The first solution he explored in the paper was to count every other event based on a flip of a coin. If the probability of the coin flip is $p$, for a true value $n$, the expected value of the counter is $n/2$, with a standard deviation of $\sqrt{n/4}$. The error for such a solution is high for small values of $n$.

Morris then presents his second solution, by introducing a counter that stores the logarithm of the number of events that have occurred $v$, and not the raw count of the events. He increments the counter value $v$ probabilistically based on a flip of a coin. The stored value in the counter is $\log{1+n}$ and the probability of increasing the counter is $1/\Delta = \exp{(v+1)}-\exp{(v)}$.

The paper is an enjoyable short read, and it is amazing that the techniques introduced in the 70's are still applicable to technology problems today.