Skip to content

Trigger stackprof by sending a SIGUSR2 signal

Igor requested to merge stackprof into master

What does this MR do?

This is a proposed implementation of #225473 (closed).

Problem statement

It's currently quite difficult to see which ruby code is spending a lot of time on CPU, and to do so safely in production (e.g. gitlab.com).

Proposed solution

It would be great to have a low-overhead sampling profiler such as stackprof available in production.

Implementation notes

The main design decision is to use a SIGUSR2 signal to trigger profiling. This was chosen for a few reasons:

  • By using a signal instead of an endpoint, it can be applied to both rails as well as sidekiq processes.
  • Since we run many web processes via puma cluster, this allows targetting specific worker processes, or easily targeting all of them via pkill -USR2 puma:.
  • Synthetic per-endpoint profiling (possibly even using the StackProf.run API) is very misleading in a production environment, because it implies that only a single "request" is being profiled, whereas stacks from the entire process are being sampled. By modelling profiling as a process-wide time-based (or manually stoppable) operation, the UI is aligned with the implementation.

The SIGUSR2 signal was chosen as it is the same one recommended by gperftools, the ancestor of stackprof. There is a minor clash with puma signals, but I believe this to be acceptable as SIGUSR1 has the same behaviour in our configuration.

As a consequence of using a signal, we need to somehow make the code interrupt and thread-safe. We can use a pipe as a signalling mechanism, and handle the profiling in a separate thread.

Stackprof works by setting a timer which will collect stacks at a given frequency. The default frequency is 1khz (1000 samples per second), I lowered it to 100hz (100 samples per second), but this can be overridden via an env variable.

The sampled stacks are held in memory until StackProf.results is called. At that point they are written out to disk, and can be garbage collected from memory.

The first SIGUSR2 will start profiling, on the second SIGUSR2, profiling is stopped and samples are written to disk. These samples can potentially use a lot of memory. In order to avoid unbounded growth, the profiler will timeout after 30 seconds and automatically stop. This should safeguard against forgetting to stop the profile.

Because the puma master has a process name of shape puma 4.3.3.gitlab.2 (unix:///Users/igor/code/gitlab-development-kit/gitlab.socket) [gitlab-puma-worker], but the workers have puma: cluster worker 0: 61472 [gitlab-puma-worker], we can use puma: to select only workers.

To initiate profile capture on all puma workers, run:

$ pkill -USR2 puma:

This will profile for 30 seconds (or until a second SIGUSR2 is sent) and then write the samples out to $TMPDIR/stackprof.$PID.$RANDOM.profile.

These profiles can then be processed via the stackprof CLI and flamegraph.pl:

$ bundle exec stackprof --stackcollapse /tmp/stackprof.55769.c6c3906452.profile | flamegraph.pl > flamegraph.svg

This will produce a flamegraph like the one you see below, and this flamegraph will represent stacks which were on-CPU (unlike rbspy).

Screenshots

A sample flamegraph from profiling gdk locally.

flamegraph.svg

Does this MR meet the acceptance criteria?

Conformity

Availability and Testing

Security

If this MR contains changes to processing or storing of credentials or tokens, authorization and authentication methods and other items described in the security review guidelines:

  • Label as security and @ mention @gitlab-com/gl-security/appsec
  • The MR includes necessary changes to maintain consistency between UI, API, email, or other methods
  • Security reports checked/validated by a reviewer from the AppSec team

cc @mkaeppler @ayufan @andrewn @stanhu @smcgivern @cmiskell @msmiley

Edited by Igor

Merge request reports