A paper a day keeps the doctor away: BlinkDB: Queries with Bounded Errors and Bounded Response Times on Very Large Data
The latest advances
in Big Data systems have made storing and computing over large amounts of data
more tractable than in the past. Users' expectations for how long a query
should take to complete have not on the other hand changed, and remain independent of the amount
of data that needs to be processed. The expectation mismatch of query run time
causes user frustration when iteratively exploring large data sets in search of
an insight. How can we alleviate that frustration? BlinkDB offers
users a way to balance result accuracy with query execution time: the users
can either get quantifiably approximate answers very quickly, or they can elect
to wait for a longer period of time to get more accurate results. BlinkDB
accomplishes this tradeoff through the magic of dynamic sample selection, and
an adaptive optimization framework.
The authors start
with an illustrative example of computing the average session time for all
users in New York. If the table that stores users' sessions is very large, and
cannot fit in memory, the query will take a long time to complete since disk
reads are expensive. If the same query ran on a smaller sample of data that
could fit in memory, it would run orders of magnitude more quickly. The query
would produce approximate results that would be accurate enough for many
practical purposes. Sampling theory can help with quantifying the accuracy of
the results.
BlinkDB is different
from other sampling based systems in that it does not make any assumptions on
the values of the filtered attributes in the WHERE, GROUP BY, and HAVING
clauses in the SQL queries it processes. The only assumptions it makes is that
the set of columns that appear in these filters remain stable over time. This
allows BlinkDB to process a flexible variety of workloads. To test the validity
of the stable column set assumption, the authors analyzed 18K queries from
Conviva--a company managing video distribution across the Internet, and 69K
queries from Facebook. In both cases, over 90% of the queries were covered by
10% of the columns.
To process the
queries, BlinkDB consists of two modules: one for sample creation, and the
other for sample selection. The sample creation module creates stratified
samples which ensure that rare values of any column are overly represented
relative to a uniform random sample. The stratified sampling strategy allows
BlinkDB to answer queries about any value regardless of how frequently it is
represented in the underlying data. The sample selection module selects which
samples to run queries on, to satisfy the query's error and response time
constraints.
BlinkDB supports a
constrained set of SQL aggregation queries--COUNT, AVG, SUM, and QUANTILE,
where operations can be annotated with error bounds. BlinkDB does not support
arbitrary joins, or nested queries, however it supports joining a large sampled
table with smaller tables, if the smaller tables can fit in memory on a single
node in the cluster.
The authors explain
in detail how BlinkDB creates stratified samples. Intuitively, the algorithm
starts with a sample size $n$, and counts the distinct dimension value
combinations $x$ for the query column sets for that sample size $N_n(x)$. After
the counts are complete, for each distinct dimension value combination $x$, the
algorithm takes uniform samples from $N_n(x)$ rows randomly without replacement
forming a sample $S_x$. The full sample space is the union of all the $S_x$
samples. The sample size $n$ directly determines the error of the query
results. Because of the relationship between the sample size $n$ and the error
rate, the authors claim that the samples are hierarchical, where $S_n \subset
S_{n^{max}}$, and so we don't need to compute multiple sample spaces decreasing
the storage overhead for samples. The authors claim that the overhead is
roughly $2.5\%$ of the original table size.
The authors explain
how BlinkDB selects the query column sets to use to create the stratified
samples. They pose the problem as an optimization problem, that factors in the
sparsity of the data, the workload characteristics, and the sample storage
cost. Once the query column sets are selected, BlinkDB creates the stratified
samples, and maintains them over time. Creating uniform samples roughly takes a
hundred seconds to create, while stratified samples take between 5 and 30
minutes to create, depending on the number of unique values in the query column
sets.
The authors
implemented BlinkDB on top of Hive/Hadoop and Shark with minimum changes to the
underlying query processing system. BlinkDB adds two major components to the
Hive framework: offline sampling modules that creates and maintains samples
over time, and a runtime sample selection component. The offline sampling
module uses techniques such as reservoir sampling and binomial sampling to
create the stratified samples across dimensions.
The authors tested
the system on a 100-node cluster using TPC-H benchmarks, and real world
workload from Facebook, and Conviva. The
clusters used Amazon's EC2 extra-large instances, each with 8 cores, 68GB of
RAM, and 800GB disk space. The cluster used 75TB of disk space, and 6TB of
distributed RAM cache.
The authors ran
performance experiments using the Conviva data--a single fact table, with about 104 columns, and 5.5
billion rows. The experiments show that queries on 17 TB of data take about 2
seconds to finish with a 90-98% accuracy vs thousands of seconds using systems
that don't use stratified sampling.
The authors end by
reviewing other systems in the literature such as Spark, Shark, and Dremel,
which work well if the data processed fits into the aggregate memory in the
cluster, but break down if the data size increases beyond that. The authors
also compare BlinkDB to other systems that employ sampling to speed up the
computations such as STRAT, SciBORQ, and AQUA.
Comments
Post a Comment