Skip to main content

A paper a day keeps the dr away: FaRM -- Fast Remote Memory

Distributed systems have allowed applications to use more computation power, memory, and physical storage than is available on a single machine, enabling applications to tackle more complex problems. The capacity increase however comes at a cost: accessing remote resources is slower than accessing ones that are local to the machine. The paper “FaRM:Fast Remote Memory” tries to address the cost of accessing remote memory, and ways to make it faster.
The authors start by acknowledging that the major cost of accessing remote memory is the networking cost between machines through the TCP/IP stack, and that faster networks could do so much. They cite the case of MemC3—a state of the art key-value store—which performed 7x worse in a client-server setup than in a single machine setup, despite request batching. The authors ask the question if the TCP/IP stack overhead is that high, what if you bypass the complex protocol stacks, and use RDMA—remote direct memory access—to access memory on another machine? How would the performance look? The rest of the paper explores that question, and introduces FaRM: fast remote access memory.
The authors start with some background on RDMA. In RDMA, requests are sent between nodes over queue-pairs, and network failures are exposed as terminated connections. The requests go directly to the NIC, without involving the kernel, and are serviced by the remote NIC without the involvement of the CPU. Similar to DMA—direct memory access—a memory region is registered with the NIC before use, and the NIC driver pins the memory regions in physical memory, and stores virtual to physical page mappings in a page table in the NIC. When an RDMA request is received, the NIC gets the page table for the target, and uses DMA to access the memory.  Since NICs have limited memory for page tables, the tables are stored in system memory, and the NIC memory acts a cache. RDMA connects to remote machines typically over InfiniBand, but recently RoCE—RDMA over converged Ethernet—is becoming more attractive, with flow control, congestion notification, and a much cheaper price--$10/Gbps for 40 Gbps RoCE compared to $60/Gpbs for 10 Gbps Ethernet.
FaRM uses a circular buffer to implement a uni-directional channel. The buffer is stored on the receiver, and there is one buffer for each sender/receiver pair. The sender uses RDMA to write messages to the buffer tail, and advances the tail pointer on every send. It also maintains a local copy of the head pointer to prevent writing messages past the head. The receiver updates the head in the sender’s copy using RDMA as well, to create space in the circular buffer. The receiver polls for new items at the head of the buffer, and process them creating space as needed. The authors indicate that the polling overhead is negligible with 78 machines. They found that at that scale, the RDMA writes and polling significantly outperform the complex InfiniBand send and receive verbs. The authors ran a micro-benchmark to compare the performance of FaRM communication with TCP/IP on a cluster of 20 machines connected by a 40 Gbps RoCE network. The results show that FaRM’s RDMA based messaging achieves an 9x-11x higher request rate than TCP/IP for request sizes between 16 and 512 bytes. Another latency micro-benchmark showed that TCP/IP latency at peak request rate is 145x higher than that of RDMA based messaging for all request sizes.
To achieve that high performance, the authors had to do some optimizations. The first was using larger pages to reduce the number of entries in the NIC page tables by implementing a kernel driver for Windows and Linux that allocates large number of physically contiguous and aligned 2GB memory regions at boot time.
The authors also optimized the number of queue-pair data, by reusing a single connection between a thread and each remote machine, and sharing queue-pairs among many threads in a machine.
The authors introduce the FaRM API, which provides an event based programming model, with operations that require polling to complete a task taking a continuation argument—continuation function, and context pointer. The continuation function is called when the operation is done, and is passed the result of the operation and the context pointer. FaRM API also provides convenience functions to allocate and free objects, and support lock-free operations.
The authors use the APIs to implement a distributed hashtable, and a graph store similar to Facebook’s Tao, and evaluate the performance of both.
The experiments used an isolated cluster with 20 machines, with 40Gbps RoCE NIC. The machines ran Windows Server 2012RC, on 2.4GHz Xeon CPUs with 8 cores and two hyper-threads per core, and a total of 128GB of DRAM, and 240GB SSDs.
For the key-value store experiments, the authors used 120 million key-value pairs per machine, and configured the hash stores for 90% occupancy, and measured the performance for 1 min after a 20 second warm-up. The results show that FaRM achieves 146 million lookups per second with a latency of 35 microseconds. The throughput is an order of magnitude higher than MemC3, and the latency two orders of magnitude lower.
For the graph store, the authors implemented a store similar to Tao, and used Facebook’s LinkBench with its default parameters for degree and data size distributions, with a resulting graph with 1 billion nodes and 4.35 billion edges. On the 20 machine clusters, the authors got 126 million operations per second, with a per machine throughput of 6.3 million operations per second that is 10x that reported for Tao. The average latency at peak was 41 microseconds, which is 40-50x lower than the reported Tao latencies.
The authors end by describing some of the systems and libraries that use RDMA to improve performance.

Comments

Popular posts from this blog

Why good customer service matters?

I am not an Apple fan, but I do like their computers, and recommend them to colleagues and friends for a variety of reasons. They are well designed, and in addition to an excellent user interface, they run a flavor of Unix--which makes the life of computer programmers a lot easier. But most importantly, Apple's customer support is impeccable, that despite all the hardware issues I experienced in the past, I still recommend Apple computers. Let me explain why. A year and a half ago, I bought a Mac Book Pro for work. At the time it was the first generation unibody laptop, that had an i7 processor, lots of memory, and lots of disk space. Alas, like first generation models everywhere, it also had a lot of hardware problems. The most annoying of which was the screen randomly turning dark, with the hard drive spinning out of control. The only way to get out of this state was by forcing a reboot by holding down the power button, and losing everything I have been working on. At first

Kindle Paperwhite

I have always been allergic to buying specialized electronic devices that do only one thing, such as the Kindle, the iPod, and fitness trackers. Why buy these when technology evolves so fast that a multi-purpose device such as the phone or a smart watch can eventually do the same thing, but with the convenience of updates that fix bugs and add functionality? So, I was shocked when this weekend I made an impulse buy and got the newest Kindle Paperwhite—a special purpose device for reading eBooks. I was walking past the Amazon store in the mall and saw that the newest Kindle Paperwhites were marked down by $40 for the holidays. The device looked good in the display, so I went in to look at it closely. The Paperwhite is small and light, with a 6” screen that is backlit and waterproof.   The text was crisp and readable, and in the ambient light, it felt like I am reading a printed book. I was sold and bought it on the spot. At home I have struggled to put it down. The books

New ASUS RT-AX88U router

  I have been using Asus routers for many years, and have been pretty happy with them. The web interface is superb, and the firmware upgrades are timely and easy to apply, and over the last couple of years have introduced newer features that kept my old router relevant and functional.   After many years of service, my older router finally gave way, and started dropping Wifi connections randomly, especially when under heavy load. The connection drop happens whenever the kids have a Zoom meeting, or my wife and I are on work calls. Turning the laptop/iPad Wifi off and on again did not help, and we usually had to reboot the router to be able to connect again. Out of curiosity I looked at the CPU/memory stats of the router under heavy load, and could not see any issues. Even when all of us were in video calls, the CPU/memory did not rise about 50%. I could not see anything abnormal in the logs either. Online I saw that a lot of people had similar problems after upgrading to the latest rout