Gather data on Gitaly CPU/Memory usage for cgroups (#16239) · Issues · GitLab.com / GitLab Infrastructure Team / Production Engineering

Gather data on Gitaly CPU/Memory usage for cgroups

### Goals Approach (from https://gitlab.com/groups/gitlab-com/gl-infra/-/epics/344#note_1077369645): * Do the long-tail analysis for outliers in the memory usage by any single-gRPC. This will mainly rely on the `rusage` measurements, exposed by the gitaly logs over the last week or so. Ensure these outliers fit in the planned per-cgroup burst ceiling. * Do the 50/95/99th percentile analysis for anonymous memory usage per gitaly node. Ideally we would exclude gitaly itself and its ruby helpers, but for a rough approximation, it's easier to include them. Including them also helps compensate for the fact that the cgroups need some room for file-backed pages too. Ensure that this anonymous memory usage distribution can still be satisfied if any one cgroup consumes its entire limit. (Example: If each cgroup's limit is 60% of the parent cgroup's limit, then the remaining 40% should be enough to cover the workload's typical usage. Otherwise, the oversubscription ratio would not adequately insulate the other cgroups from a single greedy project.) * This calibration may potentially be different for each gitaly shard: `default`, `hdd`, `marquee`, `praefect` ### Results * https://gitlab.com/gitlab-com/gl-infra/reliability/-/issues/16239#note_1102804732 - Outlier analysis summary * Judging from the last 7 days, no legitimate single git command required more than 20 GB of memory. * There were 3 gRPC calls whose git command did exceed that threshold, but they represent abuse cases that this cgroups implementation aims to mitigate. * https://gitlab.com/gitlab-com/gl-infra/reliability/-/issues/16239#note_1102870133 - Host-level anonymous memory usage summary * Covers capacity and 7-day 95th percentile utilization for the busiest host in each shard and stage. * https://gitlab.com/gitlab-com/gl-infra/reliability/-/issues/16239#note_1102877054 - Conclusions for initial calibration of cgroup sizing * Covers per-repo and parent cgroup sizing.

issue