Measure uprobe overhead to support safer instrumentation decisions

Purpose

We often use timer-based profiling at safe sampling rates like 99 or 497 samples/second. This low sampling frequency costs a generally acceptable amount of overhead for common actions like counters, latency measurements, and stack traces.

However, sometimes we need to ask questions that timer-based profiling has a harder time answering. Instrumenting all calls to a specific function can potentially cause a much higher and more volatile sampling rate than timer-based profiling. This carries some additional safety considerations, with regard to the instrumentation's performance overhead.

To support making well-informed safety decisions when choosing instrumentation points, here we measure the overhead of some common use-cases for instrumenting an arbitrary function in a userspace program using the uprobe/uretprobe interface.

Knowing the overhead lets us:

avoid causing a new performance problem while studying an existing one
avoid getting inaccurate skewed measurements due to adding excessive overhead
choose instrumentation points based on workload-specific call frequencies

Note: This benchmark focuses on uprobes only. Kernel instrumentation via kprobes and tracepoints are generally cheaper than uprobes by up to 1 order of magnitude.

Background

This week had another example of a time when answering an important behavioral question about our system required instrumenting a function that is called often enough that making the measurement could potentially impact performance.

In this case, it was ok (roughly 1% overhead for 30 seconds), but quantifying the overhead lets us make more confident decisions about the safety and impact of these measurements.

Results

General guidelines

Avoid instrumenting functions that are called more often than:

10K calls/second from any single thread
10K calls/second * [num_cpus] from all instrumented threads system wide

For stack traces, prefer framepointer over dwarf. Dwarf costs over twice as much CPU overhead as framepointer. Framepointer captures are also much smaller than dwarf captures, since dwarf saves 8 KB per sample.

Overhead measurements

The above guidelines are based on the following results summary.

Use cases:

1.05 us/call for counter via perf-stat
1.00 us/call for counter via BPF funccount
2.45 us/call for latency measurement via BPF funclatency
1.47 us/call for stack trace via perf-record using framepointers
3.55 us/call for stack trace via perf-record using DWARF

So for any single thread, the overhead exceeds 1% of CPU time (of 1 CPU) at an event rate over 10^(6-2) / overhead_per_call:

9523  calls/s for counter via perf-stat
10000 calls/s for counter via BPF funccount
4081  calls/s for latency measurement via BPF funclatency
6802  calls/s for stack trace via perf-record using framepointers
2816  calls/s for stack trace via perf-record using DWARF

When the instrumented function is called by a single thread, that thread can at most use 1 CPU, so in that case, the overhead estimate ignores the CPU count and is simply:

[percent overhead] = 100 * [call frequency] * [overhead per call]

When the instrumented function is called by many threads or processes concurrently where the number of callers approaches or exceeds the number of CPUs, then the overhead estimate is:

[percent overhead] = 100 * [call frequency] * [overhead per call] / [number of cpus]

Edited Nov 02, 2021 by Matt Smiley