Skip to main content

A paper a day keeps the doctor away: BlinkDB: Queries with Bounded Errors and Bounded Response Times on Very Large Data

The latest advances in Big Data systems have made storing and computing over large amounts of data more tractable than in the past. Users' expectations for how long a query should take to complete have not on the other hand  changed, and remain independent of the amount of data that needs to be processed. The expectation mismatch of query run time causes user frustration when iteratively exploring large data sets in search of an insight. How can we alleviate that frustration? BlinkDB offers users a way to balance result accuracy with query execution time: the users can either get quantifiably approximate answers very quickly, or they can elect to wait for a longer period of time to get more accurate results. BlinkDB accomplishes this tradeoff through the magic of dynamic sample selection, and an adaptive optimization framework.

The authors start with an illustrative example of computing the average session time for all users in New York. If the table that stores users' sessions is very large, and cannot fit in memory, the query will take a long time to complete since disk reads are expensive. If the same query ran on a smaller sample of data that could fit in memory, it would run orders of magnitude more quickly. The query would produce approximate results that would be accurate enough for many practical purposes. Sampling theory can help with quantifying the accuracy of the results.

BlinkDB is different from other sampling based systems in that it does not make any assumptions on the values of the filtered attributes in the WHERE, GROUP BY, and HAVING clauses in the SQL queries it processes. The only assumptions it makes is that the set of columns that appear in these filters remain stable over time. This allows BlinkDB to process a flexible variety of workloads. To test the validity of the stable column set assumption, the authors analyzed 18K queries from Conviva--a company managing video distribution across the Internet, and 69K queries from Facebook. In both cases, over 90% of the queries were covered by 10% of the columns.

To process the queries, BlinkDB consists of two modules: one for sample creation, and the other for sample selection. The sample creation module creates stratified samples which ensure that rare values of any column are overly represented relative to a uniform random sample. The stratified sampling strategy allows BlinkDB to answer queries about any value regardless of how frequently it is represented in the underlying data. The sample selection module selects which samples to run queries on, to satisfy the query's error and response time constraints.

BlinkDB supports a constrained set of SQL aggregation queries--COUNT, AVG, SUM, and QUANTILE, where operations can be annotated with error bounds. BlinkDB does not support arbitrary joins, or nested queries, however it supports joining a large sampled table with smaller tables, if the smaller tables can fit in memory on a single node in the cluster.

The authors explain in detail how BlinkDB creates stratified samples. Intuitively, the algorithm starts with a sample size $n$, and counts the distinct dimension value combinations $x$ for the query column sets for that sample size $N_n(x)$. After the counts are complete, for each distinct dimension value combination $x$, the algorithm takes uniform samples from $N_n(x)$ rows randomly without replacement forming a sample $S_x$. The full sample space is the union of all the $S_x$ samples. The sample size $n$ directly determines the error of the query results. Because of the relationship between the sample size $n$ and the error rate, the authors claim that the samples are hierarchical, where $S_n \subset S_{n^{max}}$, and so we don't need to compute multiple sample spaces decreasing the storage overhead for samples. The authors claim that the overhead is roughly $2.5\%$ of the original table size.

The authors explain how BlinkDB selects the query column sets to use to create the stratified samples. They pose the problem as an optimization problem, that factors in the sparsity of the data, the workload characteristics, and the sample storage cost. Once the query column sets are selected, BlinkDB creates the stratified samples, and maintains them over time. Creating uniform samples roughly takes a hundred seconds to create, while stratified samples take between 5 and 30 minutes to create, depending on the number of unique values in the query column sets.

The authors implemented BlinkDB on top of Hive/Hadoop and Shark with minimum changes to the underlying query processing system. BlinkDB adds two major components to the Hive framework: offline sampling modules that creates and maintains samples over time, and a runtime sample selection component. The offline sampling module uses techniques such as reservoir sampling and binomial sampling to create the stratified samples across dimensions.

The authors tested the system on a 100-node cluster using TPC-H benchmarks, and real world workload from Facebook, and Conviva.  The clusters used Amazon's EC2 extra-large instances, each with 8 cores, 68GB of RAM, and 800GB disk space. The cluster used 75TB of disk space, and 6TB of distributed RAM cache.

The authors ran performance experiments using the Conviva data--a single  fact table, with about 104 columns, and 5.5 billion rows. The experiments show that queries on 17 TB of data take about 2 seconds to finish with a 90-98% accuracy vs thousands of seconds using systems that don't use stratified sampling.

The authors end by reviewing other systems in the literature such as Spark, Shark, and Dremel, which work well if the data processed fits into the aggregate memory in the cluster, but break down if the data size increases beyond that. The authors also compare BlinkDB to other systems that employ sampling to speed up the computations such as STRAT, SciBORQ, and AQUA.

Comments

Popular posts from this blog

Why good customer service matters?

I am not an Apple fan, but I do like their computers, and recommend them to colleagues and friends for a variety of reasons. They are well designed, and in addition to an excellent user interface, they run a flavor of Unix--which makes the life of computer programmers a lot easier. But most importantly, Apple's customer support is impeccable, that despite all the hardware issues I experienced in the past, I still recommend Apple computers. Let me explain why. A year and a half ago, I bought a Mac Book Pro for work. At the time it was the first generation unibody laptop, that had an i7 processor, lots of memory, and lots of disk space. Alas, like first generation models everywhere, it also had a lot of hardware problems. The most annoying of which was the screen randomly turning dark, with the hard drive spinning out of control. The only way to get out of this state was by forcing a reboot by holding down the power button, and losing everything I have been working on. At first

Kindle Paperwhite

I have always been allergic to buying specialized electronic devices that do only one thing, such as the Kindle, the iPod, and fitness trackers. Why buy these when technology evolves so fast that a multi-purpose device such as the phone or a smart watch can eventually do the same thing, but with the convenience of updates that fix bugs and add functionality? So, I was shocked when this weekend I made an impulse buy and got the newest Kindle Paperwhite—a special purpose device for reading eBooks. I was walking past the Amazon store in the mall and saw that the newest Kindle Paperwhites were marked down by $40 for the holidays. The device looked good in the display, so I went in to look at it closely. The Paperwhite is small and light, with a 6” screen that is backlit and waterproof.   The text was crisp and readable, and in the ambient light, it felt like I am reading a printed book. I was sold and bought it on the spot. At home I have struggled to put it down. The books

New ASUS RT-AX88U router

  I have been using Asus routers for many years, and have been pretty happy with them. The web interface is superb, and the firmware upgrades are timely and easy to apply, and over the last couple of years have introduced newer features that kept my old router relevant and functional.   After many years of service, my older router finally gave way, and started dropping Wifi connections randomly, especially when under heavy load. The connection drop happens whenever the kids have a Zoom meeting, or my wife and I are on work calls. Turning the laptop/iPad Wifi off and on again did not help, and we usually had to reboot the router to be able to connect again. Out of curiosity I looked at the CPU/memory stats of the router under heavy load, and could not see any issues. Even when all of us were in video calls, the CPU/memory did not rise about 50%. I could not see anything abnormal in the logs either. Online I saw that a lot of people had similar problems after upgrading to the latest rout