Skip to main content

Posts

A paper a day keeps the doctor away: NoDB

In most database systems, the user defines the shape of the data that is stored and queried using concepts such as entities and relations. The database system takes care of translating that shape into physical storage, and managing its lifecycle. Most of the systems store data in the form of tuples, either in row format, or broken down into columns and stored in columnar format. The system also stores metadata associated with the data, that helps with speedy retrieval and processing. Defining the shape of the data a priori, and transforming it from the raw or ingestion format to the storage format is a cost that database systems incur to make queries faster. What if we can have fast queries without incurring that initial cost? In the paper " NoDB: Efficient Query Execution on Raw Data Files ", the authors examine that question, and advocate a system (NoDB) that answers it. The authors start with the motivation for such a system. With the recent explosion of data...

A paper a day keeps the doctor away: MillWheel: Fault-Tolerant Stream Processing at Internet Scale

The recent data explosion, and the increase in appetite for fast results spurred a lot of interest in low-latency data processing systems. One such system is MillWheel, presented in the paper " MillWheel: Fault-Tolerant Stream Processing at Internet Scale ", which is widely used at Google. In MillWheel, the users specify a directed computation graph that describe what they would like to do, and write application code that runs on each individual node in the graph. The system takes care of managing the flow of data within the graph, persisting the state of the computation, and handling any failures that occur, relieving the users from that burden. MillWheel exposes an API for record processing, that handles each record in an idempotent fashion, with an exactly once delivery semantics. The system checkpoints progress with a fine granularity, removing the need to buffer data between external senders. The authors describe the system using the Zeitgeist produ...

A paper a day keeps the doctor away: Gorilla: A Fast, Scalable, In-Memory Time Series Database

Operating large scale Internet services today is a challenge, and making sure that the services run well with minimal customer disruptions is doubly so. The reason is that both require good visibility into how the individual service components are performing, which necessitates gathering and analyzing a lot of measurements about the performance.    The measurements vary from metrics annotated with labels or dimensions that can be used to filter and group the results at query time, to exception stacks, log lines, and trace events. Collecting and analyzing such a large amount of metrics is the realm of time series databases, and the paper: " Gorilla, a fast, scalable, in-memory time series database " presents such a system which is in use at Facebook to handle monitoring and alerting their vast infrastructure. In the paper the authors start by articulating the design principles for Gorilla: they wanted a system that is always available for writes; they tolerated th...

A paper a day keeps the doctor away: FIT A Distributed Database Performance Tradeoff

In distributed systems, the CAP theorem provides a framework for thinking about the consistency, availability, and partition tolerance guarantees a system can provide. In their paper " FIT, a distributed database performance trandeoff ", Faleiro and Abadi present a similar framework for thinking about distributed database performance. The authors start with some intuition about distributed transactions: ones that rely on data that sits in different nodes in a distributed system. For the distributed transaction to guarantee atomicity, coordination between the participating nodes is required, The coordination offers systems designers a tradeoff choice between throughput and strong isolation. Guaranteeing strong isolation impacts the system throughput, and increasing throughput would imply allowing transactions to execute concurrently in spite of the presence of conflicts. The authors introduce another variable, fairness, that interplays with the tradeoffs between strong ...

A paper a day keeps the doctor away: The 8 Requirements of Real-Time Stream Processing

In recent years there has been an explosion of data all around us. The data comes in from a variety of sources, such as financial real-time systems, cell phone networks, sensor networks--RFID and IoT, and GPS. Commensurate with this dramatic increase in data, is a corresponding unquenchable thirst for analysis and insights. The natural question arises: how do we build systems that process and makes sense of this vast amount of data, in as close to real-time as possible? What patterns of software and systems should we look at? Michael Stonebraker of database fame et al. offer some advice on what to consider in their paper: " The 8 requirements of real-time stream processing " published a decade ago. In the paper, the authors list eight guiding principles that high-volume low-latency systems should follow to be able to process vast amounts of data in near real-time. First, the systems have to keep the data moving, and do straight-through processing with minimal to no writ...

Why do you need to warm diesel engines

In the Pacific Northwest, a lot of people use their trucks as everyday commute vehicles, which makes sense, since the climate is wet, and the terrain hilly, and in wet and cold conditions people feel safer in their four wheel drive vehicles. Some of these trucks are the heavy duty ones, with big diesel engines, and lately I have noticed at work a couple idling sans driver for at least 5 minutes. It got me curious about why would you need to idle a diesel engine especially since modern gasoline engines do not require idling before driving off and putting load on the engine. A web search helped piece the answer to this puzzle. Diesels operate differently than gasoline engines. Instead of relying on spark plugs to light up the air and fuel mixture inside of the engine cylinders, diesels rely on high compression ratios that cause the air and fuel mixture inside of the cylinders to ignite.  Because of the high compression ratios, diesel engines are typically bulkier and more stu...

A week with Edge

I have been using Internet Explorer ever since I switched back to Windows, and have been satisfied with it. Apart from its end of life status, and a couple of annoying bugs when I have more than 10 tabs open, it has served me well. With the latest Windows 10 update, I wanted to try the next generation browser: Edge. Going in, I knew that Edge is not a finished product, and that it has a long way until it competes with the other established browsers on the market. Nevertheless I decided to give it a try. My first experiences with it were positive: it is light weight and very fast, and when I have many tabs open it does not suffer from the same feat as IE does, where the browser hangs randomly and the abominable recover web page ribbon appears at the bottom of the screen. I was also surprised when I did not end up using the cool new features such as the readability view and web notes as much as I thought. I liked the integration with Cortana through the context menu, which I ca...