Adventures in Tech

Posts

A paper a day keeps the doctor away: Brewer's Conjecture and the Feasibility of Consistent, Available, Partition-Tolerant Web Services

Sixteen year ago, Eric Brewer introduced what is now known as the CAP theorem, which states that for a web service it is impossible to guarantee consistency, availability, and partition tolerance. The conjecture was based on Brewer's experiences at Inktomi--a search engine company he cofounded, and was published without proof. Gilbert and Lynch presented one in their paper: " Brewer's Conjecture and the Feasibility of Consistent, Available, Partition-Tolerant Web Services ." The paper is a good theoretical read, and the proofs the authors present are very tractable. They first begin by formalizing the concepts of consistency (the authors use atomic in the paper), availability, and partition tolerance. For a consistent service, there is a total order on all operations such that each operation looks as if it were completed at a single instant. For availability, every request received by a non-failing node in the system must result in a response. Finally f...

A paper a day keeps the doctor away: Medians and Beyond: New Aggregation Techniques for Sensor Networks

The Internet of Things has spurred renewed interest in sensors and sensor telemetry. For each IoT application, modern sensors--ones that support sensing, computation, and communication, collect data about their environment, and relay it back to a "base station". The base station aggregates data from all the sensors to make sense of the telemetry. Since most of the sensors are deployed in settings that have severe constrains on power consumption and battery life, it is usually expensive to send all the raw telemetry data to the "base station" without aggregation or sampling at each sensor. For certain class of metrics data, such as counts, sums, and averages, the base station can combine the aggregates from each sensor, and come up with meaningful statistics overall. However for sensor aggregates such as median, mode, and percentiles, the job of the base station becomes harder. The authors of the paper " Medians and Beyond: New Aggregation Techn...

A paper a day keeps the doctor away: NoDB

In most database systems, the user defines the shape of the data that is stored and queried using concepts such as entities and relations. The database system takes care of translating that shape into physical storage, and managing its lifecycle. Most of the systems store data in the form of tuples, either in row format, or broken down into columns and stored in columnar format. The system also stores metadata associated with the data, that helps with speedy retrieval and processing. Defining the shape of the data a priori, and transforming it from the raw or ingestion format to the storage format is a cost that database systems incur to make queries faster. What if we can have fast queries without incurring that initial cost? In the paper " NoDB: Efficient Query Execution on Raw Data Files ", the authors examine that question, and advocate a system (NoDB) that answers it. The authors start with the motivation for such a system. With the recent explosion of data...

A paper a day keeps the doctor away: MillWheel: Fault-Tolerant Stream Processing at Internet Scale

The recent data explosion, and the increase in appetite for fast results spurred a lot of interest in low-latency data processing systems. One such system is MillWheel, presented in the paper " MillWheel: Fault-Tolerant Stream Processing at Internet Scale ", which is widely used at Google. In MillWheel, the users specify a directed computation graph that describe what they would like to do, and write application code that runs on each individual node in the graph. The system takes care of managing the flow of data within the graph, persisting the state of the computation, and handling any failures that occur, relieving the users from that burden. MillWheel exposes an API for record processing, that handles each record in an idempotent fashion, with an exactly once delivery semantics. The system checkpoints progress with a fine granularity, removing the need to buffer data between external senders. The authors describe the system using the Zeitgeist produ...

A paper a day keeps the doctor away: Gorilla: A Fast, Scalable, In-Memory Time Series Database

Operating large scale Internet services today is a challenge, and making sure that the services run well with minimal customer disruptions is doubly so. The reason is that both require good visibility into how the individual service components are performing, which necessitates gathering and analyzing a lot of measurements about the performance. The measurements vary from metrics annotated with labels or dimensions that can be used to filter and group the results at query time, to exception stacks, log lines, and trace events. Collecting and analyzing such a large amount of metrics is the realm of time series databases, and the paper: " Gorilla, a fast, scalable, in-memory time series database " presents such a system which is in use at Facebook to handle monitoring and alerting their vast infrastructure. In the paper the authors start by articulating the design principles for Gorilla: they wanted a system that is always available for writes; they tolerated th...

A paper a day keeps the doctor away: FIT A Distributed Database Performance Tradeoff

In distributed systems, the CAP theorem provides a framework for thinking about the consistency, availability, and partition tolerance guarantees a system can provide. In their paper " FIT, a distributed database performance trandeoff ", Faleiro and Abadi present a similar framework for thinking about distributed database performance. The authors start with some intuition about distributed transactions: ones that rely on data that sits in different nodes in a distributed system. For the distributed transaction to guarantee atomicity, coordination between the participating nodes is required, The coordination offers systems designers a tradeoff choice between throughput and strong isolation. Guaranteeing strong isolation impacts the system throughput, and increasing throughput would imply allowing transactions to execute concurrently in spite of the presence of conflicts. The authors introduce another variable, fairness, that interplays with the tradeoffs between strong ...

A paper a day keeps the doctor away: The 8 Requirements of Real-Time Stream Processing

In recent years there has been an explosion of data all around us. The data comes in from a variety of sources, such as financial real-time systems, cell phone networks, sensor networks--RFID and IoT, and GPS. Commensurate with this dramatic increase in data, is a corresponding unquenchable thirst for analysis and insights. The natural question arises: how do we build systems that process and makes sense of this vast amount of data, in as close to real-time as possible? What patterns of software and systems should we look at? Michael Stonebraker of database fame et al. offer some advice on what to consider in their paper: " The 8 requirements of real-time stream processing " published a decade ago. In the paper, the authors list eight guiding principles that high-volume low-latency systems should follow to be able to process vast amounts of data in near real-time. First, the systems have to keep the data moving, and do straight-through processing with minimal to no writ...