Ruby CPU profiling (and flamegraphs) in production
Problem statement
It's currently quite difficult to see which ruby code is spending a lot of time on CPU, and to do so safely in production (e.g. gitlab.com).
Proposed solution
It would be great to have a low-overhead sampling profiler such as stackprof available in production. Possibly with the ability to trigger it via a process signal, this was it can be triggered for sidekiq processes as well as web ones.
Background
As traffic on gitlab.com grows, so does the size of our fleet. One of the resources we are constrained by is CPU. In order to manage the trajectory of our cost, it's a good idea to profile our production workload and find opportunities to optimize, as that translates directly into infrastructure cost savings.
We do have some mechanisms in place to profile such CPU usage:
- As part of our logging, we log per-request durations, as well as per-thread time spent on-cpu via
CLOCK_THREAD_CPUTIME_ID
. This gives an idea of CPU time spent on a certain endpoint (very useful information to have!), but not which code paths were hot. - Request profiling allows enabling ruby-prof to profile a single request. However, because ruby-prof is a tracing profiler, it is not suitable for production. See also: How do Ruby & Python profilers work?.
- rbspy is a non-invasive profiler that can be run from outside of the process. The main limitation at time of writing however, is that AFAIK it samples all stacks, not only on-CPU ones. So it can indicate where threads were busy or blocked, it doesn't tell us whether the thread was actually running. See also: Measure how much time is spent on CPU.
- perf record allows us to profile CPU usage at the ruby VM (C) level -- but it's not really possible to correlate this with application-level stacks. We may get an indication for which types of ruby operations are expensive, but not where they originate from in the ruby code.
One solution that fills those gaps is stackprof. It runs within the ruby process. It's a sampling profiler, which means the overhead is very low, making it suitable for production profiling.
The collected data can also be visualized as a flamegraph, allowing us to see where we can potentially optimize the code.
It would be great if we were able to trigger these profiles on demand. gperftools, which is the inspiration for stackprof, has a mechanism for triggering profiles via a process signal.
We could adopt a similar strategy. A signal to start the profile. Send the same signal to stop the profiler, and perhaps have some default limits on duration and file size that trigger profiling to stop automatically.
As a longer-term goal would be to get continuous profiling for ruby, similar to how we have it set up for go services (gitlab-com/gl-infra/scalability#257 (closed), gitlab-com/gl-infra/scalability#334 (closed)). But I'd consider that out of scope for this issue.