Skip to main content

A paper a day keeps the doctor away: Gorilla: A Fast, Scalable, In-Memory Time Series Database

Operating large scale Internet services today is a challenge, and making sure that the services run well with minimal customer disruptions is doubly so. The reason is that both require good visibility into how the individual service components are performing, which necessitates gathering and analyzing a lot of measurements about the performance.   The measurements vary from metrics annotated with labels or dimensions that can be used to filter and group the results at query time, to exception stacks, log lines, and trace events. Collecting and analyzing such a large amount of metrics is the realm of time series databases, and the paper: "Gorilla, a fast, scalable, in-memory time series database" presents such a system which is in use at Facebook to handle monitoring and alerting their vast infrastructure.

In the paper the authors start by articulating the design principles for Gorilla: they wanted a system that is always available for writes; they tolerated the loss of a small amount of data; and they wanted a geo-replicated system that is partition tolerant to network cuts, software updates, and configuration changes.

From their prior experience with monitoring systems at Facebook, the authors made a couple of insightful observations. First, they noticed that users of monitoring systems are mostly interested in recent data points rather than historical data, with almost 85% of the diagnostics queries being for data collected in the last 26 hours. Second they observed that users are interested in aggregate data rather than individual data points for diagnostics scenarios. Third the authors noticed a big disparity between write rates that exceeded millions per second, and read rates that are orders of magnitude lower, and mostly from dashboards and alerting systems.

These observations steered the authors toward an in-memory system that ingests and computes over the last 26 hours of data,  backed by an HBase solution for the historical data. At Facebook's scale this presented a challenge, and the authors shared some of the perf requirements in the paper:

  • 700 million data points (time stamp and value) added per minute
  • Data stored for 26 hours
  • More than 40,000 queries per second at peak
  • Reads succeed in under one millisecond.
  • Support time series with 15 second granularity (4 points per minute per time series)
  • Two in-memory, not co-located replicas (for disaster recovery capacity)
  • Always serve reads even when a single server crashes
  • Ability to quickly scan over all in memory data
  • Support at least 2x growth per year.


The authors came up with brilliant compression schemes that allowed them to compress the time series data to roughly 1.37 bytes per point, or a 12x reduction in size, which allowed them to store 26 hours' worth of data in 20 hosts with 1.3TB of memory.

Each time series consists of monitoring data, modeled as a tuple: a string that serves as a key to the time series, a 64 bit timestamp, and a double precision float value. The time series are sharded by key, and each key goes to a Gorilla host. Every write goes to two different geo locations, and on host failure, all reads failover to the different region.

The authors compress the time series data using delta encoding using two ways. First is compressing the timestamp for every monitoring data tuple. The series starts with an uncompressed 64 bit timestamp. Further timestamps are compressed using delta-of-deltas. For example, if the deltas are 60, 60, 50, and 61 the delta of deltas are  0, -1, and 2. The delta of deltas are further encoded using variable length encoding depending on the range the delta falls into. For -63 to 64 the prefix is 10 followed by 7bits, for -255 to 256 the prefix is 110 followed by 9bits, and for -2047 to 2048 the prefix is 1110 followed by 12bits, otherwise the prefix is 1111 followed by 32 bits.

For monitored values, the floating points are compressed by xor'ing with the prior value, and encoding the resulting integer with a header '11' in two bits, number of leading zeros, how many meaningful bits exist, and the value of the meaningful bits. Again, the first monitored value is stored without compression. The subsequent values are XOR'ed with the previous values. If the result is 0, a single 0 bit is stored. Otherwise the authors calculate the number of leading and trailing zeros, and store a prefix 1 followed by:
0 if the block of meaningful bits falls within the block of previous meaningful bits, and store the meaningful block
Otherwise store 1, the length of leading 0s, the length of meaningful bits in the XORed value, and the meaningful bits

For Facebook's monitored data, the authors found that roughly 51% of the values are encoded with 1 bit,  about 30% compressed with '10', and 19% compressed with '11'.

The authors use in-memory data structures to store time series data with a lookup map from time series name to the data for quick access. When time series are deleted, they are tomb stoned.

The in-memory data structures are backed by disk, and the data is stored in GlusterFS with 3x replication. Each host owns multiple shards of data stored in a single directory per shard. The directory contains key lists, append only logs, complete block files, and checkpoints. The shard manager uses Paxos to assign shards to nodes.

Data is streamed between Gorilla instances without guaranteeing consistency between the geo locations. If one geo fails,  the query traffic is sent to the other geo center, and is only restored after the region has been healthy for 26 hours.

The authors then describe the tools that use Gorilla for monitoring and alerting. There is a low latency query processing tool, a correlation engine that compares the time series to millions of others to figure out what happened around the time a service broke, a charting service to visually find where the outliers are, and an aggregation service.

The authors end with their experiences operating Gorilla over the prior 6 months including fault tolerance, and help with diagnostic scenarios, and close with the lessons they learned, namely:
  • Prioritize recent data over historical data
  • Read latency matters
  • High availability trumps resource efficiency

Comments

Popular posts from this blog

Kindle Paperwhite

I have always been allergic to buying specialized electronic devices that do only one thing, such as the Kindle, the iPod, and fitness trackers. Why buy these when technology evolves so fast that a multi-purpose device such as the phone or a smart watch can eventually do the same thing, but with the convenience of updates that fix bugs and add functionality? So, I was shocked when this weekend I made an impulse buy and got the newest Kindle Paperwhite—a special purpose device for reading eBooks. I was walking past the Amazon store in the mall and saw that the newest Kindle Paperwhites were marked down by $40 for the holidays. The device looked good in the display, so I went in to look at it closely. The Paperwhite is small and light, with a 6” screen that is backlit and waterproof.   The text was crisp and readable, and in the ambient light, it felt like I am reading a printed book. I was sold and bought it on the spot. At home I have struggled to put it down. The bo...

A paper a day keeps the doctor away: NoDB

In most database systems, the user defines the shape of the data that is stored and queried using concepts such as entities and relations. The database system takes care of translating that shape into physical storage, and managing its lifecycle. Most of the systems store data in the form of tuples, either in row format, or broken down into columns and stored in columnar format. The system also stores metadata associated with the data, that helps with speedy retrieval and processing. Defining the shape of the data a priori, and transforming it from the raw or ingestion format to the storage format is a cost that database systems incur to make queries faster. What if we can have fast queries without incurring that initial cost? In the paper " NoDB: Efficient Query Execution on Raw Data Files ", the authors examine that question, and advocate a system (NoDB) that answers it. The authors start with the motivation for such a system. With the recent explosion of data...

A paper a day keeps the dr away: Dapper a Large-Scale Distributed Systems Tracing Infrastructure

Modern Internet scale applications are a challenge to monitor and diagnose. The applications are usually comprised of complex distributed systems that are built by multiple teams, sometimes using different languages and technologies. When one component fails or misbehaves, it becomes a nightmare to figure out what went wrong and where. Monitoring and tracing systems aim to make that problem a bit more tractable, and Dapper, a system by Google for large scale distributed systems tracing is one such system. The paper starts by setting the context for Dapper through the use of a real service: "universal search". In universal search, the user types in a query that gets federated to multiple search backends such as web search, image search, local search, video search, news search, as well as advertising systems to display ads. The results are then combined and presented back to the user. Thousands of machines could be involved in returning that result, and any poor p...