Operating large
scale Internet services today is a challenge, and making sure that the services
run well with minimal customer disruptions is doubly so. The reason is that
both require good visibility into how the individual service components are
performing, which necessitates gathering and analyzing a lot of measurements
about the performance.   The measurements
vary from metrics annotated with labels or dimensions that can be used to
filter and group the results at query time, to exception stacks, log lines, and
trace events. Collecting and analyzing such a large amount of metrics is the
realm of time series databases, and the paper: "Gorilla, a fast,
scalable, in-memory time series database" presents such a system which
is in use at Facebook to handle monitoring and alerting their vast
infrastructure.
In the paper the
authors start by articulating the design principles for Gorilla: they wanted a
system that is always available for writes; they tolerated the loss of a small
amount of data; and they wanted a geo-replicated system that is partition tolerant
to network cuts, software updates, and configuration changes.
From their prior
experience with monitoring systems at Facebook, the authors made a couple of
insightful observations. First, they noticed that users of monitoring systems
are mostly interested in recent data points rather than historical data, with
almost 85% of the diagnostics queries being for data collected in the last 26
hours. Second they observed that users are interested in aggregate data rather
than individual data points for diagnostics scenarios. Third the authors
noticed a big disparity between write rates that exceeded millions per second,
and read rates that are orders of magnitude lower, and mostly from dashboards
and alerting systems.
These observations
steered the authors toward an in-memory system that ingests and computes over
the last 26 hours of data,  backed by an
HBase solution for the historical data. At Facebook's scale this presented a challenge,
and the authors shared some of the perf requirements in the paper:
- 700 million data points (time stamp and value) added per minute
- Data stored for 26 hours
- More than 40,000 queries per second at peak
- Reads succeed in under one millisecond.
- Support time series with 15 second granularity (4 points per minute per time series)
- Two in-memory, not co-located replicas (for disaster recovery capacity)
- Always serve reads even when a single server crashes
- Ability to quickly scan over all in memory data
- Support at least 2x growth per year.
The authors came up
with brilliant compression schemes that allowed them to compress the time
series data to roughly 1.37 bytes per point, or a 12x reduction in size, which
allowed them to store 26 hours' worth of data in 20 hosts with 1.3TB of memory.
Each time series
consists of monitoring data, modeled as a tuple: a string that serves as a key
to the time series, a 64 bit timestamp, and a double precision float value. The
time series are sharded by key, and each key goes to a Gorilla host. Every write
goes to two different geo locations, and on host failure, all reads failover to
the different region. 
The authors compress
the time series data using delta encoding using two ways. First is compressing
the timestamp for every monitoring data tuple. The series starts with an
uncompressed 64 bit timestamp. Further timestamps are compressed using
delta-of-deltas. For example, if the deltas are 60, 60, 50, and 61 the delta of
deltas are  0, -1, and 2. The delta of
deltas are further encoded using variable length encoding depending on the
range the delta falls into. For -63 to 64 the prefix is 10 followed by 7bits,
for -255 to 256 the prefix is 110 followed by 9bits, and for -2047 to 2048 the
prefix is 1110 followed by 12bits, otherwise the prefix is 1111 followed by 32
bits. 
For monitored
values, the floating points are compressed by xor'ing with the prior value, and
encoding the resulting integer with a header '11' in two bits, number of
leading zeros, how many meaningful bits exist, and the value of the meaningful
bits. Again, the first monitored value is stored without compression. The
subsequent values are XOR'ed with the previous values. If the result is 0, a
single 0 bit is stored. Otherwise the authors calculate the number of leading
and trailing zeros, and store a prefix 1 followed by:
0 if the block of
meaningful bits falls within the block of previous meaningful bits, and store
the meaningful block
Otherwise store 1,
the length of leading 0s, the length of meaningful bits in the XORed value, and
the meaningful bits
For Facebook's
monitored data, the authors found that roughly 51% of the values are encoded
with 1 bit,  about 30% compressed with
'10', and 19% compressed with '11'.
The authors use
in-memory data structures to store time series data with a lookup map from time
series name to the data for quick access. When time series are deleted, they
are tomb stoned.
The in-memory data
structures are backed by disk, and the data is stored in GlusterFS with 3x
replication. Each host owns multiple shards of data stored in a single
directory per shard. The directory contains key lists, append only logs,
complete block files, and checkpoints. The shard manager uses Paxos to assign
shards to nodes.
Data is streamed
between Gorilla instances without guaranteeing consistency between the geo
locations. If one geo fails,  the query
traffic is sent to the other geo center, and is only restored after the region
has been healthy for 26 hours.
The authors then
describe the tools that use Gorilla for monitoring and alerting. There is a low
latency query processing tool, a correlation engine that compares the time
series to millions of others to figure out what happened around the time a
service broke, a charting service to visually find where the outliers are, and
an aggregation service.
The authors end with
their experiences operating Gorilla over the prior 6 months including fault
tolerance, and help with diagnostic scenarios, and close with the lessons they
learned, namely:
- Prioritize recent data over historical data
- Read latency matters
- High availability trumps resource efficiency
Comments
Post a Comment