Modern Internet
scale applications are a challenge to monitor and diagnose. The applications
are usually comprised of complex distributed systems that are built by multiple
teams, sometimes using different languages and technologies. When one component
fails or misbehaves, it becomes a nightmare to figure out what went wrong and
where. Monitoring and tracing systems aim to make that problem a bit more
tractable, and Dapper,
a system by Google for large scale distributed systems tracing is one such
system.
The paper starts by
setting the context for Dapper through the use of a real service:
"universal search". In universal search, the user types in a query
that gets federated to multiple search backends such as web search, image
search, local search, video search, news search, as well as advertising systems
to display ads. The results are then combined and presented back to the user.
Thousands of machines could be involved in returning that result, and any poor
performance in one of them can cause end-user latency. For services such as
search, the latency is very important, since end-users are very sensitive to
it. How can one diagnose such latency problems and pinpoint the offending
sub-service?
Enter Dapper,
Google's distributed tracing infrastructure. The authors start by listing the
system's requirements and design goals: low monitoring overhead, application
level transparency, scalability, and low latency availability of the
data--roughly within a minute from generation. The author explain that Dapper
chooses application level transparency instead of cooperative monitoring where
developers write code to instrument their components, because the latter is
fragile due to instrumentation omissions and bugs. To achieve transparency,
Dapper restricts tracing instrumentation to a small corpus of ubiquitous
threading, control flow, and RPC libraries.
Dapper also uses adaptive sampling to scale the system to the vast
amount of telemetry generated, and reduce the overhead of collecting data. The
authors compare how Dapper differs from other distributed tracing systems such
as Pinpoint, Magpie, and X-trace.
The authors then
explain how Dapper stitches federated requests together, as in the example of
universal search, where a single query fans out to multiple services, that in
turn could fan out the query to another tier of sub-services. The authors
explain the two approaches commonly used to stitch the causal relationship
between requests: black box scheme, which relies on statistical inference to
form the sub-request relationships, and annotation based scheme, where each
request is annotated to help form these relations. Dapper implements an
annotation based scheme, which is made possible because most services at Google
communicate uniformly using RPC. The approach is not restrictive though, since
one can instrument other protocols such as HTTP, SMTP, etc. to the same effect.
Dapper models the
relationship between requests using concepts such as trees, spans, and
annotations.
In a trace, the
basic unit of work is the span: identified by a name, span id, and a parent id.
A single span models an RPC call, and spans are organized into a trace tree
through the causal relationship of the spans that fulfill the request. For
example every call to an additional infrastructure layer adds another span at a
lower depth in the trace tree. A span contains information from each RPC, which
usually involves a client-server pair, with the corresponding annotations
(client send/receive, server send/receive, and application specific
annotations)
Dapper
auto-instruments applications to build trace trees, with spans and annotations
at the following points:
- When a thread handles a traced control path
- Asynchronous calls through Google's callback libraries
- Communication through Google's RPC libraries
The tracing is
language independent, and supports code written in C++ and Java.
The authors present
the Dapper architecture, which implements a three stage process:
- Instrumented binaries write span data to local disk
- Daemons pull the instrumentation from all production machines to Dapper collectors
- Collectors write traces to Big Table with trace ids as the row key, and span ids as the column keys
The median latency
for the process from when data is written locally to when it is available in
Big Table is 15 seconds.
Dapper exposes an
API that makes accessing trace data in Big Table easy. For security and privacy
concerns Dapper stores only the names of the RPC methods, and not their
payload. The annotations API enables application developers to add payload
information if needed on an opt-in basis. The authors share some statistics on
Dapper's usage within Google, including usage of the annotations API.
The authors evaluate
the telemetry overhead for the generation, and collection stages, as well as
the effect on production workloads. The creation overhead comes from generating
and destroying spans and annotations, and persisting them to disk. The authors
share that root spans add roughly 200ns, and that span annotations add
negligible overhead (9ns-40ns) on a $2.2 GHz$ machine. The CPU overhead is
roughly 0.3% in the worst case scenario, and networking overhead presents
$0.01\%$ of the total network traffic. The latency overhead depends on the
sampling rate, with full collection adding $16\%$ overhead to request latency,
and $1/16$ sampling and below adding negligible overhead. The authors found
that in high volume applications, a sampling rate of $1/1024$ contains enough
information for diagnostics.
For lower traffic
workloads, Dapper employs adaptive sampling that is parametrized by the desired
rate of traces per unit time. The traces record the sampling probability used,
which helps with analysis later. With sampling, Dapper users generate $1TB/day$,
and store the data for 2 weeks.
In addition to the
collection infrastructure, the Dapper team built an eco-system of tools that
make accessing and analyzing the data a lot easier, including a depot API that
provides trace access by ID, bulk access
through MapReduce operations, and indexed access. Dapper also provides a web
interface for users to interact with the data.
The authors end with
cataloguing Dapper usage within Google, from use during development phase of
Ads Review services to help with performance improvements and discovering
bottlenecks, to addressing long tail latency, inferring services dependencies,
and network usage of various services.
Comments
Post a Comment