Measure uprobe overhead to support safer instrumentation decisions
Purpose
We often use timer-based profiling at safe sampling rates like 99 or 497 samples/second. This low sampling frequency costs a generally acceptable amount of overhead for common actions like counters, latency measurements, and stack traces.
However, sometimes we need to ask questions that timer-based profiling has a harder time answering. Instrumenting all calls to a specific function can potentially cause a much higher and more volatile sampling rate than timer-based profiling. This carries some additional safety considerations, with regard to the instrumentation's performance overhead.
To support making well-informed safety decisions when choosing instrumentation points, here we measure the overhead of some common use-cases for instrumenting an arbitrary function in a userspace program using the uprobe/uretprobe interface.
Knowing the overhead lets us:
- avoid causing a new performance problem while studying an existing one
- avoid getting inaccurate skewed measurements due to adding excessive overhead
- choose instrumentation points based on workload-specific call frequencies
Note: This benchmark focuses on uprobes only. Kernel instrumentation via kprobes and tracepoints are generally cheaper than uprobes by up to 1 order of magnitude.
Background
This week had another example of a time when answering an important behavioral question about our system required instrumenting a function that is called often enough that making the measurement could potentially impact performance.
In this case, it was ok (roughly 1% overhead for 30 seconds), but quantifying the overhead lets us make more confident decisions about the safety and impact of these measurements.
Results
General guidelines
Avoid instrumenting functions that are called more often than:
- 10K calls/second from any single thread
- 10K calls/second * [num_cpus] from all instrumented threads system wide
For stack traces, prefer framepointer over dwarf. Dwarf costs over twice as much CPU overhead as framepointer. Framepointer captures are also much smaller than dwarf captures, since dwarf saves 8 KB per sample.
Overhead measurements
The above guidelines are based on the following results summary.
Use cases:
1.05 us/call for counter via perf-stat
1.00 us/call for counter via BPF funccount
2.45 us/call for latency measurement via BPF funclatency
1.47 us/call for stack trace via perf-record using framepointers
3.55 us/call for stack trace via perf-record using DWARF
So for any single thread, the overhead exceeds 1% of CPU time (of 1 CPU) at an event rate over 10^(6-2) / overhead_per_call
:
9523 calls/s for counter via perf-stat
10000 calls/s for counter via BPF funccount
4081 calls/s for latency measurement via BPF funclatency
6802 calls/s for stack trace via perf-record using framepointers
2816 calls/s for stack trace via perf-record using DWARF
When the instrumented function is called by a single thread, that thread can at most use 1 CPU, so in that case, the overhead estimate ignores the CPU count and is simply:
[percent overhead] = 100 * [call frequency] * [overhead per call]
When the instrumented function is called by many threads or processes concurrently where the number of callers approaches or exceeds the number of CPUs, then the overhead estimate is:
[percent overhead] = 100 * [call frequency] * [overhead per call] / [number of cpus]