doc/cgroups.md · 382d1e57b2cf02763d3d65e31ff4d38f467b797c · GitLab.org / gitaly

Cgroups: add cpu_quota_us limit · 80ce55b0
Steve Xuereb authored Feb 27, 2023
What
---
- Add a new configuration under `cgroups` called `cpu_quota_us` to
  configure `cfs_quota_us` for the parent cgroup
  https://docs.kernel.org/scheduler/sched-bwc.html?highlight=cfs_quota_us
- Add a new configuration under `cgroups.repositories` called
  `cpu_quota_us` to configure `cfs_quota_us` for the repository cgroup
  https://docs.kernel.org/scheduler/sched-bwc.html?highlight=cfs_quota_us
- Add metrics
    - `gitaly_cgroup_cpu_cfs_periods_total`: Read from `cpu.stat` nr_periods https://docs.kernel.org/scheduler/sched-bwc.html#statistics
    - `gitaly_cgroup_cpu_cfs_throttled_periods_total`: Read from `cpu.stat` nr_throttled https://docs.kernel.org/scheduler/sched-bwc.html#statistics
    - `gitaly_cgroup_cpu_cfs_throttled_seconds_total`: Read from `cpu.stat` throttled_time https://docs.kernel.org/scheduler/sched-bwc.html#statistics
- Add more test coverage when only specific values are set.

Why
---
At the moment we limit memory and CPU via
[`cpu.shares`](https://kernel.googlesource.com/pub/scm/linux/kernel/git/glommer/memcg/+/cpu_stat/Documentation/cgroups/cpu.txt)
which will only throttle a cgroup when there is contention on the CPU.
This means that potentially a single repository can still hog all of the
CPU on a gitaly node. We've seen a case of this in
gitlab-com/gl-infra/production#8318, a
single repository saturated the CPU, and the scheduler couldn't balance
the CPU for other tasks/requests to be scheduled.

We hoped CPU shares would be enough, but we need an upper CPU quota for
gitaly cgroups so no single repository can fully saturate the CPU.

There are a few concerns that are addressed

Concern 1: cfs_period_us

`cfs_period_us` is used to calculate the `cfs_quota_us` (what we are
setting now), the default value seems to be
[hardcoded](https://git.kernel.org/pub/scm/linux/kernel/git/stable/linux.git/tree/kernel/sched/fair.c?h=v5.15.92#n5492)
but the Linux kernel but this can be updated, so Gitaly is explicitly
settings this to 100ms (default value)

Concern 2: not using `cfs_burst_us`

This could allow for CPU bursts, even when they exceed the
`cfs_quota_us`, we don't set this because it's available on the newer kernel
versions (5.15). The way users can avoid throttling is by
oversubscribing `cfs_quota_us`

Concern 3: Wasting available resources

When the user sets these we'll be artificially limiting the CPU that they
consume, this can leave performance on the table when a repository is
using all its quota, and no other process is using the CPU. This is the
only drawback and one we are willing to take since it adds more
reliability in the long run. We can reduce the effect of this by oversubscribing.

Concern 4: Observability

The kernel already exports
[stats](https://docs.kernel.org/scheduler/sched-bwc.html#statistics)
which Gitaly exposes as, and also
[cadvisor](https://github.com/google/cadvisor/blob/master/docs/storage/prometheus.md#prometheus-container-metrics)

Reference: https://gitlab.com/gitlab-com/gl-infra/reliability/-/issues/17332


Changelog: added
Signed-off-by: Steve Azzopardi <sazzopardi@gitlab.com>
80ce55b0