Friday, April 14, 2017

A paper a day keeps the dr away: FaRM -- Fast Remote Memory

Distributed systems have allowed applications to use more computation power, memory, and physical storage than is available on a single machine, enabling applications to tackle more complex problems. The capacity increase however comes at a cost: accessing remote resources is slower than accessing ones that are local to the machine. The paper “FaRM:Fast Remote Memory” tries to address the cost of accessing remote memory, and ways to make it faster.
The authors start by acknowledging that the major cost of accessing remote memory is the networking cost between machines through the TCP/IP stack, and that faster networks could do so much. They cite the case of MemC3—a state of the art key-value store—which performed 7x worse in a client-server setup than in a single machine setup, despite request batching. The authors ask the question if the TCP/IP stack overhead is that high, what if you bypass the complex protocol stacks, and use RDMA—remote direct memory access—to access memory on another machine? How would the performance look? The rest of the paper explores that question, and introduces FaRM: fast remote access memory.
The authors start with some background on RDMA. In RDMA, requests are sent between nodes over queue-pairs, and network failures are exposed as terminated connections. The requests go directly to the NIC, without involving the kernel, and are serviced by the remote NIC without the involvement of the CPU. Similar to DMA—direct memory access—a memory region is registered with the NIC before use, and the NIC driver pins the memory regions in physical memory, and stores virtual to physical page mappings in a page table in the NIC. When an RDMA request is received, the NIC gets the page table for the target, and uses DMA to access the memory.  Since NICs have limited memory for page tables, the tables are stored in system memory, and the NIC memory acts a cache. RDMA connects to remote machines typically over InfiniBand, but recently RoCE—RDMA over converged Ethernet—is becoming more attractive, with flow control, congestion notification, and a much cheaper price--$10/Gbps for 40 Gbps RoCE compared to $60/Gpbs for 10 Gbps Ethernet.
FaRM uses a circular buffer to implement a uni-directional channel. The buffer is stored on the receiver, and there is one buffer for each sender/receiver pair. The sender uses RDMA to write messages to the buffer tail, and advances the tail pointer on every send. It also maintains a local copy of the head pointer to prevent writing messages past the head. The receiver updates the head in the sender’s copy using RDMA as well, to create space in the circular buffer. The receiver polls for new items at the head of the buffer, and process them creating space as needed. The authors indicate that the polling overhead is negligible with 78 machines. They found that at that scale, the RDMA writes and polling significantly outperform the complex InfiniBand send and receive verbs. The authors ran a micro-benchmark to compare the performance of FaRM communication with TCP/IP on a cluster of 20 machines connected by a 40 Gbps RoCE network. The results show that FaRM’s RDMA based messaging achieves an 9x-11x higher request rate than TCP/IP for request sizes between 16 and 512 bytes. Another latency micro-benchmark showed that TCP/IP latency at peak request rate is 145x higher than that of RDMA based messaging for all request sizes.
To achieve that high performance, the authors had to do some optimizations. The first was using larger pages to reduce the number of entries in the NIC page tables by implementing a kernel driver for Windows and Linux that allocates large number of physically contiguous and aligned 2GB memory regions at boot time.
The authors also optimized the number of queue-pair data, by reusing a single connection between a thread and each remote machine, and sharing queue-pairs among many threads in a machine.
The authors introduce the FaRM API, which provides an event based programming model, with operations that require polling to complete a task taking a continuation argument—continuation function, and context pointer. The continuation function is called when the operation is done, and is passed the result of the operation and the context pointer. FaRM API also provides convenience functions to allocate and free objects, and support lock-free operations.
The authors use the APIs to implement a distributed hashtable, and a graph store similar to Facebook’s Tao, and evaluate the performance of both.
The experiments used an isolated cluster with 20 machines, with 40Gbps RoCE NIC. The machines ran Windows Server 2012RC, on 2.4GHz Xeon CPUs with 8 cores and two hyper-threads per core, and a total of 128GB of DRAM, and 240GB SSDs.
For the key-value store experiments, the authors used 120 million key-value pairs per machine, and configured the hash stores for 90% occupancy, and measured the performance for 1 min after a 20 second warm-up. The results show that FaRM achieves 146 million lookups per second with a latency of 35 microseconds. The throughput is an order of magnitude higher than MemC3, and the latency two orders of magnitude lower.
For the graph store, the authors implemented a store similar to Tao, and used Facebook’s LinkBench with its default parameters for degree and data size distributions, with a resulting graph with 1 billion nodes and 4.35 billion edges. On the 20 machine clusters, the authors got 126 million operations per second, with a per machine throughput of 6.3 million operations per second that is 10x that reported for Tao. The average latency at peak was 41 microseconds, which is 40-50x lower than the reported Tao latencies.
The authors end by describing some of the systems and libraries that use RDMA to improve performance.

Friday, March 24, 2017

Virtual machine could not be started because the hypervisor is not running


I wanted to experiment with TensorFlow, and decided to do that in a Linux VM, despite the fact that Windows Subsystem for Linux exists. In the past I used Sun’s, and then Oracle’s VirtualBox to manage virtual machines, but since my Windows install had Hyper-V, I decided to use that instead. The virtual machine configuration was easy, with disk, networking, and memory configurations non-eventful. However when I tried to start the virtual machine to setup Ubuntu from an ISO, I was greeted with the following error:

“Virtual machine could not be started because the hypervisor is not running”

A quick Internet search revealed that a lot of people have faced that problem, and most of the community board solutions did not make any sense. The hidden gem is this technet article, which included detailed steps to find if the Windows Hypervisor was running or not, and the error message if it failed to launch. In my case, the error was:

“Hyper-V launch failed; Either VMX not present or not enabled in BIOS."

The fix here is easy, and buried in another technet article. Simply reboot the machine entering BIOS setup mode, and disable VT-d and trusted execution settings. After a quick reboot, the hypervisor is happily humming along, and the setup of my Ubuntu VM is complete.

Monday, January 30, 2017

On brewing tea

I watched a video interview with the 10th heir of Twinings Tea Company, that has been merchandising tea for over 300 years. In the interview, among talking about the family history, and the story behind their bestselling tea flavor—Earl Grey—he talked about the best way to brew tea, whether using loose leaves, or a tea bag.

To extract the most flavor out of tea, he recommended bringing cold water to a boil, and removing the kettle off the stove once the water starts boiling. His theory is that the flavor is extracted through the air in the water, and continuing to boil the water further, will reduce the amount of air in it.

For green teas, he recommends letting the kettle set for 5 mins, then pouring the hot water over the tea, and for black teas, he recommends pouring the hot water immediately over the tea. The heir advised against removing the bag, or repeatedly dunking it in the water during brewing, because that only changes the color of the water, and makes the tea bitter without extracting flavor. On the contrary, he recommends leaving the tea bag still for 3 minutes, and then throwing it away, enjoying the flavorful tea, adding milk, or lemon, but never sugar, as it masks the tea flavor.


I followed his advice verbatim, and while I am not sure if the effect is psychological or real, I drank the best cup of tea in years. No bitterness, no sweetness, just great tea flavor.

Thursday, January 26, 2017

Random acts of kindness

When I have the chance, I like to walk to my meetings instead of using the shuttle service available on campus. When it is not raining, the walk is very refreshing: I get to clear out my thoughts on the walk, and get in some number of steps for my daily activity.  After one of my meetings ended, I started to head back to my building, only to see that it started to down pour. To my luck, there was a shuttle parked upfront. I asked the driver if she could take me back to my building, and she said she was on her lunch break. As I said no worries, I’ll just walk back, she insisted that she can drive me. I hopped in the shuttle, thanking her profusely for taking the time from her lunch break to drive me back, she insisted it was not a big deal. Such an act of kindness made my day, and it is a great reminder to continue doing good things to others, simply for the joy it brings them.

Friday, December 30, 2016

A hole in the wall


I am a big fan of good and delicious food, irrespective of where it is sold. That includes street vendors, and “holes in the wall,” which I have always associated with small nondescript places, with no signs on the venue, no place to sit, and a staff that exudes a slightly higher risk of contracting dysentery, typhoid, or other gastrointestinal diseases. That description might be a bit extreme, but I had some of the best meals in similar places, including the famous Hyderabadi Dum-Biryani in a place not so far from that description.

So where did the phrase a “hole in the wall” come from? On another historical tour of Florence, our tour guide and language enthusiast pointed out some of the palaces where Italian nobility such as the Medici family lived long time ago. Invariably at the entrance there was a slit or a hole in the wall, and the tour guide told us the story that after the nobility hosted lavish dinner parties, instead of throwing the remaining food away, they would give it to the unfortunate lining up in front of the palace through that small hole in the wall of the building. Since the food was delicious, eating at the hole in the wall was sought after during these times, and the tour guide surmised that this is the origin of the phrase. I could not verify that claim, however one site online lists a similar story:

              the hole made in the wall of the debtors' or other prisons, through which the poor prisoners received the money, broken meat, or other donations of the charitably inclined”

Regardless of the origin of the phrase, the story and the imagery were vivid, and they stuck with me.

Thursday, December 29, 2016

A paper a day keeps the doctor away: Efficient Commit Protocols for the Tree of Processes Model of Distributed Transactions

The two-phase commit protocol is widely known in the database community, yet despite its notoriety, finding the paper that first described it proved pretty difficult. The closest find is the paper about commit protocols for distributed transactions, which describes the protocol in detail, and introduces two additional modifications. The paper references the original publications of the protocol by Gray, Lamport, and others, however I could not find these notes  online.

The paper describes the 2PC protocol, both under normal operation, and when failures occur. For normal operation, when the user decides to commit a transaction, the coordinator initiates the first phase of the protocol--the prepare phase, by sending prepare messages to all its subordinates. Each subordinate that can commit the transaction, writes a prepare log record, and sends an ACK back to the coordinator. The subordinates that can't commit the transaction, write an abort log record, and  respond with an Abort message to the coordinator. An abort message vetoes the transaction, and so the subordinates that cannot commit the transaction can just abort it, and forget about it.

After the coordinator receives all the responses from its subordinates, it initiates the second phase of the protocol. If all responses were YES votes, the coordinator moves to the commit phase, where it writes a commit record, and sends a commit message to all the subordinates.  Upon receiving the commit message, the subordinates write a commit log, send an ACK to the coordinator, and commit the transaction.

On the other hand if one subordinate vetoed the transaction, the coordinator moves to the abort phase, writes an abort record, and sends abort messages to all the subordinates that are in the prepared state. Each subordinate writes an abort record, sends an ACK to the coordinator, and aborts the transaction. Once the coordinator receives all the ACK messages from its subordinates, it writes an end record, and forgets about the transaction.

In the protocol, all record writes happen before sending messages, which minimizes communications with the coordinator when recovering from failure.

With all these messages going around it is hard to envision that everything will go on smoothly. The authors then describe the 2PC protocol in the presence of failures due to site and networking issues. The authors assume that as part of the recovery, there is a process that reads the logs on stable storage and accumulates information about the executing transactions at the time of the crash. This information is used to respond to queries from other sites. The authors then present a comprehensive state machine of where the transaction failed during the 2PC protocol, and how to recover from it.  For example, if the transaction was in the prepared state, the recovery process tries to contact the coordinator to see how to proceed with the transaction. When the recovery site responds, the recovery process proceeds with handling the Commit/Abort response according to the 2PC in the absence of failures. If the recovery process finds a transaction without a commit log, it rolls back the transaction. If it finds a transaction in the committing/aborting states--when the node is acting as a coordinator, before the crash--the recovery process periodically tries to send Commit/Abort messages to the subordinates that have not acknowledged yet. Once all ACKs are received, the recovery process ends the transaction, and moves along.

The authors then present modifications of the 2PC commit that optimize the messages sent between the coordinators, and the subordinates. They observe that in the absence of any information in the crashed site about a transaction, the correct response is to abort the transaction. This observation leads to the presumed abort protocol. The protocol takes advantage of knowing that some subordinates might execute complete and partial read-only transactions: ones where there is no UNDO/REDO logs written. For these transactions, we can skip parts of the 2PC protocol. For example, if a subordinate during a prepare statement finds the transaction read-only, it issues a READ vote to the coordinator, and forgets the transaction.  The coordinator then does not include the subordinate in the COMMIT/ABORT phase of the protocol. The coordinator also skips the second phase of the 2PC protocol if it is READ only, and gets READ votes from its subordinates. The authors present other states of the presumed abort protocol, and what messages in the 2PC protocol are skipped.

The authors then examine what would happen if they eliminated the ACKs to the COMMIT messages. The observations lead to the presumed commit modification of the 2PC protocol. In presumed commit, the coordinator behaves similar to the presumed abort protocol, with minor modifications:
  • Before sending a prepare statement, the coordinator collects the names of all subordinates
  • It writes both abort and commit records
  • It requires ACKs for aborts, and not for commits
  • It writes an end record after aborts and not for commits
  • For read only transactions, it does not write any record

The authors end by comparing the log I/O performance, and messages sent of the 2PC protocol, vs Presumed Abort, and Presumed Commit protocols, and describing how to extend the protocols to multi-level tree of processes, where non-leaf nodes act as coordinators as well as subordinates, while leaf nodes act as subordinates, with the root node as a coordinator.

Thursday, December 22, 2016

The wrong end of the stick


One of my favorite activities while traveling is to take a walking tour of the city I am visiting. The tour usually consists of a small group led by a tour guide, who invariably is a student of art or history studying abroad, or an expat humanities graduate who is living abroad and is augmenting their income by giving tours. The tours are always enjoyable, combining stories about the city and its history, architecture, and cultural spots with frequent stops to coffee and dessert shops. Sometimes you get a special tour guide, who in addition to being a history buff, is also a linguistics enthusiast. When that happens, you hear special stories about the historical origin of phrases: something I am very interested in.
In Rome, I had such a tour guide, and the story stuck with me, although I could not verify its accuracy. I could find one website that has a similar reference to the story. It was hilarious and I remembered it to this day. It is the story of the origin of the phrase “the wrong end of the stick.” The tour guide explained that in the old Roman empire, before the advent of toilet paper and private sanitation, people used to go to public toilets to relieve themselves. When they were done, they would wipe themselves using a stick with a sponge at the end, and pass the sticks around after cleaning up. You can imagine how you’d feel if you grabbed the stick by the wrong end.