Tuesday, February 23, 2016

A paper a day keeps the doctor away: Gorilla: A Fast, Scalable, In-Memory Time Series Database

Operating large scale Internet services today is a challenge, and making sure that the services run well with minimal customer disruptions is doubly so. The reason is that both require good visibility into how the individual service components are performing, which necessitates gathering and analyzing a lot of measurements about the performance.   The measurements vary from metrics annotated with labels or dimensions that can be used to filter and group the results at query time, to exception stacks, log lines, and trace events. Collecting and analyzing such a large amount of metrics is the realm of time series databases, and the paper: "Gorilla, a fast, scalable, in-memory time series database" presents such a system which is in use at Facebook to handle monitoring and alerting their vast infrastructure.

In the paper the authors start by articulating the design principles for Gorilla: they wanted a system that is always available for writes; they tolerated the loss of a small amount of data; and they wanted a geo-replicated system that is partition tolerant to network cuts, software updates, and configuration changes.

From their prior experience with monitoring systems at Facebook, the authors made a couple of insightful observations. First, they noticed that users of monitoring systems are mostly interested in recent data points rather than historical data, with almost 85% of the diagnostics queries being for data collected in the last 26 hours. Second they observed that users are interested in aggregate data rather than individual data points for diagnostics scenarios. Third the authors noticed a big disparity between write rates that exceeded millions per second, and read rates that are orders of magnitude lower, and mostly from dashboards and alerting systems.

These observations steered the authors toward an in-memory system that ingests and computes over the last 26 hours of data,  backed by an HBase solution for the historical data. At Facebook's scale this presented a challenge, and the authors shared some of the perf requirements in the paper:

  • 700 million data points (time stamp and value) added per minute
  • Data stored for 26 hours
  • More than 40,000 queries per second at peak
  • Reads succeed in under one millisecond.
  • Support time series with 15 second granularity (4 points per minute per time series)
  • Two in-memory, not co-located replicas (for disaster recovery capacity)
  • Always serve reads even when a single server crashes
  • Ability to quickly scan over all in memory data
  • Support at least 2x growth per year.


The authors came up with brilliant compression schemes that allowed them to compress the time series data to roughly 1.37 bytes per point, or a 12x reduction in size, which allowed them to store 26 hours' worth of data in 20 hosts with 1.3TB of memory.

Each time series consists of monitoring data, modeled as a tuple: a string that serves as a key to the time series, a 64 bit timestamp, and a double precision float value. The time series are sharded by key, and each key goes to a Gorilla host. Every write goes to two different geo locations, and on host failure, all reads failover to the different region.

The authors compress the time series data using delta encoding using two ways. First is compressing the timestamp for every monitoring data tuple. The series starts with an uncompressed 64 bit timestamp. Further timestamps are compressed using delta-of-deltas. For example, if the deltas are 60, 60, 50, and 61 the delta of deltas are  0, -1, and 2. The delta of deltas are further encoded using variable length encoding depending on the range the delta falls into. For -63 to 64 the prefix is 10 followed by 7bits, for -255 to 256 the prefix is 110 followed by 9bits, and for -2047 to 2048 the prefix is 1110 followed by 12bits, otherwise the prefix is 1111 followed by 32 bits.

For monitored values, the floating points are compressed by xor'ing with the prior value, and encoding the resulting integer with a header '11' in two bits, number of leading zeros, how many meaningful bits exist, and the value of the meaningful bits. Again, the first monitored value is stored without compression. The subsequent values are XOR'ed with the previous values. If the result is 0, a single 0 bit is stored. Otherwise the authors calculate the number of leading and trailing zeros, and store a prefix 1 followed by:
0 if the block of meaningful bits falls within the block of previous meaningful bits, and store the meaningful block
Otherwise store 1, the length of leading 0s, the length of meaningful bits in the XORed value, and the meaningful bits

For Facebook's monitored data, the authors found that roughly 51% of the values are encoded with 1 bit,  about 30% compressed with '10', and 19% compressed with '11'.

The authors use in-memory data structures to store time series data with a lookup map from time series name to the data for quick access. When time series are deleted, they are tomb stoned.

The in-memory data structures are backed by disk, and the data is stored in GlusterFS with 3x replication. Each host owns multiple shards of data stored in a single directory per shard. The directory contains key lists, append only logs, complete block files, and checkpoints. The shard manager uses Paxos to assign shards to nodes.

Data is streamed between Gorilla instances without guaranteeing consistency between the geo locations. If one geo fails,  the query traffic is sent to the other geo center, and is only restored after the region has been healthy for 26 hours.

The authors then describe the tools that use Gorilla for monitoring and alerting. There is a low latency query processing tool, a correlation engine that compares the time series to millions of others to figure out what happened around the time a service broke, a charting service to visually find where the outliers are, and an aggregation service.

The authors end with their experiences operating Gorilla over the prior 6 months including fault tolerance, and help with diagnostic scenarios, and close with the lessons they learned, namely:
  • Prioritize recent data over historical data
  • Read latency matters
  • High availability trumps resource efficiency