Continuous Profiling of Go Services
Problem Statement
Labkit (Go) provides a monitoring endpoint. This means that most GitLab golang services (including Pages, Workhorse, Gitaly and Praefect) support pprof monitoring out of the box.
This has been used in two recent and relatively serious events:
-
Performance of GitLab.com for the period of Sept-Nov 2019 #64 (closed)
- @stanhu used pprof to find the cause of the Gitaly slowdowns: production#1501 (comment 262051339)
- Possible new memory usage/leak in gitaly: https://gitlab.com/gitlab-com/gl-infra/production/issues/1552
Using pprof in both these cases proved to be very effective.
However, there are several problems with this approach at present:
- Developers, who are best placed to interpret pprof results do not have access to pprof information
- An operator with node access needs to take the snapshot, save it somewhere and provide access to developers
- Since pprof snapshots are taken on a adhoc basis, its not possible to compare the current state of the system to a known good state (say, from before the last deploy)
Proposed Solution
We institute continuous pprof profiling across the fleet. A blog documenting this approach is here: https://medium.com/@tvii/continuous-profiling-and-go-6c0ab4d2504b
Implementation
- On a regular schedule, a small script collects heap and cpu pprofs against each Go service in the fleet: examples of how this is done: https://gitlab.com/gitlab-com/runbooks/blob/master/howto/gitaly-profiling.md, https://jvns.ca/blog/2017/09/24/profiling-go-with-pprof/
- Using
go tool pprof, or a tiny custom Go program, we generate a set of reports from the pprof profiles - There reports are saved to a GCS bucket, using the path convention
/<program_name>/<yyyy>/<mm>/<dd>/<hh:mm:ss>/<fqdn>/<port>/ - The GCS bucket is set to automatically delete reports are a week (or another period)
Usage
- Reviewing profiles is now fairly straight-forward: access to the GCS bucket is granted to the engineering organisation
- During an incident, a developer or operator can quickly find a recent profile, and compare it to an older one.
This may have helped reduce the MTTD in both #64 (closed) and https://gitlab.com/gitlab-com/gl-infra/production/issues/1552.
Security
The reports generated from the profile do not contain security sensitive information and should therefore be safe to share throughout the organisation. Read access to the bucket could be granted to the entire organisation.
Questions
Wouldn't running continuous profiling slow things down?
When pprof was added to Gitaly (around about 2 years ago: gitlab-org/gitaly#776 (closed)) we did extensive performance testing and found the effects of running pprof were negligible. If we were concerned we could start with a daily profile during a quiet period and ramp if up from there as we better understand the value and risks.