In most database
systems, the user defines the shape of the data that is stored and queried
using concepts such as entities and relations. The database system takes care
of translating that shape into physical storage, and managing its lifecycle.
Most of the systems store data in the form of tuples, either in row format, or
broken down into columns and stored in columnar format. The system also stores
metadata associated with the data, that helps with speedy retrieval and
processing. Defining the shape of the data a priori, and transforming it from
the raw or ingestion format to the storage format is a cost that database
systems incur to make queries faster. What if we can have fast queries without
incurring that initial cost? In the paper "NoDB:
Efficient Query Execution on Raw Data Files", the authors examine that
question, and advocate a system (NoDB) that answers it.
The authors start
with the motivation for such a system. With the recent explosion of data, and
our insatiable appetite for quick insights, loading and transforming the data a
priori is not a welcome overhead. This is especially the case if the user wants
to perform a series of ad-hoc explorations on vast amounts of data, and is
interested in decreasing the data-to-query time. The authors propose that one
way to accomplish that is to eliminate the loading phase, and advocate that
querying over the raw data in-situ is a viable option.
Instead of starting
from scratch, the authors modify a traditional relational database (PostgreSQL)
into a NoDB system, and discuss how they improved query performance using
techniques such as selective parsing, adaptive indexing, and cashing
techniques.
The authors start
with two straightforward approaches to query raw data files. In the first
approach, they propose to run the loading procedure whenever a table is
referenced, and then evaluate the query on the loaded data. The loaded data may
be discarded after the query executes, or may be persisted on disk.
In the second
approach the authors discuss tightly integrated raw file access within the
query execution logic, where the leaf operators of a query plan are enriched
with the ability to access raw data files including parsing and tokenization.
The data parsing and processing occurs in a pipelined fashion, where the raw
file is read in chunks and parsed and passed immediately to the rest of the
query plan operators.
Both approaches
require the schema to be known and declared a priori, and the tables to be
defined as in-situ tables; features that are offered by modern database systems
such as MySQL. The downside of both approaches is that the data is not kept in
persistent tables, and so for repeated queries against the tables, every
invocation needs to perform the loading and parsing from scratch. Moreover both
approaches can't make use of indexes and cost optimizations for the raw data
file.
The authors then
examine the NoDB solution, which aims to provide in-situ query processing
performance competitive with traditional database systems. The authors discuss
how they created data structures that minimize the cost of raw data access for
in-situ querying, and selectively eliminate the need for raw data access
through careful caching and scheduling. The authors use processing over a CSV
file as a vehicle to explain their ideas.
When a query is
submitted to the NoDB solution, and references relational tables that are not
yet loaded, the system overrides the scan operator with the ability to access
raw data files directly, with the remaining query plan generated by the
optimizer remaining unchanged. The system speeds up the raw access through a
variety of techniques. First, it performs selective tokenization of the raw
files, where it aborts tokenizing tuples as soon as it finds the required
attributes for the query. Second it forms tuples selectively, where only the
ones that contain the attributes for a query are fully composed. Third it keeps
an adaptive positional map that describes the raw file shape discovered so far.
The system uses the positional map to reduce parsing and tokenization costs by
using it to navigate and retrieve raw data faster.
Initially the
positional map is empty. As the system executes queries it continuously
augments the map by populating it while tokenizing the raw file for the current
query. For subsequent queries, the system uses the map to jump to the exact
position in the file for the tuple it is looking for, or as close to it as
possible.
NoDB also uses a
cache that temporarily holds previous accessed data, and accesses the cache
instead of reprocessing the raw data file when another query runs.
Another technique
the authors used is to calculate table statistics using sampled data instead of
the full data set. The authors modified the
PostgreSQL scan operator to create statistics on the fly, and the system
would invoke the native PostgreSQL statistics routines with a sample of the
data, and store and use them similar to the way the conventional DB does.
The authors share
the result of the experiments they've conducted on the modified system, and
compare it traditional DBMS performance for similar queries. The authors
implemented NoDB on top of PostgreSQL 9.0, and ran their experiments on a Sun
X4140 server with 2x Quad-Core AMD Opteron processor (64 bit), 2.7 GHz, 512 KB
L1 cache, 2 MB L2 cache and 6 MB L3 cache, 32 GB RAM, 4 x 250GB 10000 RPM SATA
disks (RAID-0), using Ubuntu 9.04. The experiments used a raw data file of 11GB
containing 7.5 million tuples, each containing 150 attributes with random
integers.
The authors
investigate the effect of the positional map with varying storage capacity. The
authors show that the positional map improved response times by a factor of 2.
The authors also observed linear scalability as the file size was increased
gradually from 2GB to 92GB.
The other
experiments show that the auxiliary structures (caching, maps) reduce the time
to access raw data files and amortize the overhead across subsequent queries.
The authors close
with the challenges with their approach including data type conversion, complex
data schemas, and integration with external tools, and the opportunities for
NoDB including flexible storage, adaptive indexing, and file system interfaces.
Comments
Post a Comment