Gather data on Gitaly CPU/Memory usage for cgroups
Goals
Approach (from &344 (comment 1077369645)):
- Do the long-tail analysis for outliers in the memory usage by any single-gRPC. This will mainly rely on the
rusagemeasurements, exposed by the gitaly logs over the last week or so. Ensure these outliers fit in the planned per-cgroup burst ceiling. - Do the 50/95/99th percentile analysis for anonymous memory usage per gitaly node. Ideally we would exclude gitaly itself and its ruby helpers, but for a rough approximation, it's easier to include them. Including them also helps compensate for the fact that the cgroups need some room for file-backed pages too. Ensure that this anonymous memory usage distribution can still be satisfied if any one cgroup consumes its entire limit. (Example: If each cgroup's limit is 60% of the parent cgroup's limit, then the remaining 40% should be enough to cover the workload's typical usage. Otherwise, the oversubscription ratio would not adequately insulate the other cgroups from a single greedy project.)
- This calibration may potentially be different for each gitaly shard:
default,hdd,marquee,praefect
Results
-
https://gitlab.com/gitlab-com/gl-infra/reliability/-/issues/16239#note_1102804732 - Outlier analysis summary
- Judging from the last 7 days, no legitimate single git command required more than 20 GB of memory.
- There were 3 gRPC calls whose git command did exceed that threshold, but they represent abuse cases that this cgroups implementation aims to mitigate.
-
https://gitlab.com/gitlab-com/gl-infra/reliability/-/issues/16239#note_1102870133 - Host-level anonymous memory usage summary
- Covers capacity and 7-day 95th percentile utilization for the busiest host in each shard and stage.
-
https://gitlab.com/gitlab-com/gl-infra/reliability/-/issues/16239#note_1102877054 - Conclusions for initial calibration of cgroup sizing
- Covers per-repo and parent cgroup sizing.
Edited by Matt Smiley