Distributed
systems have allowed applications to use more computation power, memory, and
physical storage than is available on a single machine, enabling applications
to tackle more complex problems. The capacity increase however comes at a cost:
accessing remote resources is slower than accessing ones that are local to the
machine. The paper “FaRM:Fast Remote Memory” tries to address the cost of accessing
remote memory, and ways to make it faster.
The authors start
by acknowledging that the major cost of accessing remote memory is the
networking cost between machines through the TCP/IP stack, and that faster
networks could do so much. They cite the case of MemC3—a state of the art
key-value store—which performed 7x worse in a client-server setup than in a
single machine setup, despite request batching. The authors ask the question if
the TCP/IP stack overhead is that high, what if you bypass the complex protocol
stacks, and use RDMA—remote direct memory access—to access memory on another
machine? How would the performance look? The rest of the paper explores that
question, and introduces FaRM: fast remote access memory.
The authors start
with some background on RDMA. In RDMA, requests are sent between nodes over
queue-pairs, and network failures are exposed as terminated connections. The
requests go directly to the NIC, without involving the kernel, and are serviced
by the remote NIC without the involvement of the CPU. Similar to DMA—direct
memory access—a memory region is registered with the NIC before use, and the
NIC driver pins the memory regions in physical memory, and stores virtual to
physical page mappings in a page table in the NIC. When an RDMA request is
received, the NIC gets the page table for the target, and uses DMA to access
the memory. Since NICs have limited
memory for page tables, the tables are stored in system memory, and the NIC
memory acts a cache. RDMA connects to remote machines typically over InfiniBand,
but recently RoCE—RDMA over converged Ethernet—is becoming more attractive, with
flow control, congestion notification, and a much cheaper price--$10/Gbps for
40 Gbps RoCE compared to $60/Gpbs for 10 Gbps Ethernet.
FaRM uses a
circular buffer to implement a uni-directional channel. The buffer is stored on
the receiver, and there is one buffer for each sender/receiver pair. The sender
uses RDMA to write messages to the buffer tail, and advances the tail pointer
on every send. It also maintains a local copy of the head pointer to prevent
writing messages past the head. The receiver updates the head in the sender’s
copy using RDMA as well, to create space in the circular buffer. The receiver
polls for new items at the head of the buffer, and process them creating space
as needed. The authors indicate that the polling overhead is negligible with 78
machines. They found that at that scale, the RDMA writes and polling
significantly outperform the complex InfiniBand send and receive verbs. The
authors ran a micro-benchmark to compare the performance of FaRM communication
with TCP/IP on a cluster of 20 machines connected by a 40 Gbps RoCE network.
The results show that FaRM’s RDMA based messaging achieves an 9x-11x higher
request rate than TCP/IP for request sizes between 16 and 512 bytes. Another
latency micro-benchmark showed that TCP/IP latency at peak request rate is 145x
higher than that of RDMA based messaging for all request sizes.
To achieve that
high performance, the authors had to do some optimizations. The first was using
larger pages to reduce the number of entries in the NIC page tables by
implementing a kernel driver for Windows and Linux that allocates large number
of physically contiguous and aligned 2GB memory regions at boot time.
The authors also optimized
the number of queue-pair data, by reusing a single connection between a thread
and each remote machine, and sharing queue-pairs among many threads in a
machine.
The authors
introduce the FaRM API, which provides an event based programming model, with
operations that require polling to complete a task taking a continuation
argument—continuation function, and context pointer. The continuation function
is called when the operation is done, and is passed the result of the operation
and the context pointer. FaRM API also provides
convenience functions to allocate and free objects, and support lock-free
operations.
The authors use
the APIs to implement a distributed hashtable, and a graph store similar to
Facebook’s Tao, and evaluate the performance of both.
The experiments
used an isolated cluster with 20 machines, with 40Gbps RoCE NIC. The machines
ran Windows Server 2012RC, on 2.4GHz Xeon CPUs with 8 cores and two
hyper-threads per core, and a total of 128GB of DRAM, and 240GB SSDs.
For the key-value
store experiments, the authors used 120 million key-value pairs per machine,
and configured the hash stores for 90% occupancy, and measured the performance
for 1 min after a 20 second warm-up. The results show that FaRM achieves 146
million lookups per second with a latency of 35 microseconds. The throughput is
an order of magnitude higher than MemC3, and the latency two orders of
magnitude lower.
For the graph
store, the authors implemented a store similar to Tao, and used Facebook’s
LinkBench with its default parameters for degree and data size distributions,
with a resulting graph with 1 billion nodes and 4.35 billion edges. On the 20
machine clusters, the authors got 126 million operations per second, with a per
machine throughput of 6.3 million operations per second that is 10x that
reported for Tao. The average latency at peak was 41 microseconds, which is
40-50x lower than the reported Tao latencies.
The authors end
by describing some of the systems and libraries that use RDMA to improve
performance.
Comments
Post a Comment