Gather data on Gitaly CPU/Memory usage for cgroups

Goals

Approach (from &344 (comment 1077369645)):

  • Do the long-tail analysis for outliers in the memory usage by any single-gRPC. This will mainly rely on the rusage measurements, exposed by the gitaly logs over the last week or so. Ensure these outliers fit in the planned per-cgroup burst ceiling.
  • Do the 50/95/99th percentile analysis for anonymous memory usage per gitaly node. Ideally we would exclude gitaly itself and its ruby helpers, but for a rough approximation, it's easier to include them. Including them also helps compensate for the fact that the cgroups need some room for file-backed pages too. Ensure that this anonymous memory usage distribution can still be satisfied if any one cgroup consumes its entire limit. (Example: If each cgroup's limit is 60% of the parent cgroup's limit, then the remaining 40% should be enough to cover the workload's typical usage. Otherwise, the oversubscription ratio would not adequately insulate the other cgroups from a single greedy project.)
  • This calibration may potentially be different for each gitaly shard: default, hdd, marquee, praefect

Results

Edited by Matt Smiley