Skip to content

[prometheus] gprd - Prometheus resource audit (only gprd-gitlab-gke release=gitlab-monitoring)

Raúl Naveiras requested to merge rnaveiras-prometheus-audit-gprd into master

What and Why

Issue: https://gitlab.com/gitlab-com/gl-infra/reliability/-/issues/24579

Needs: !3427 (merged) merge and rebase this MR

Common changes and best practices

  • Add memory limits. No having a memory limit in kubernetes increases the chance of being killed by the OOMKiller.
  • Memory Request = Memory Limit - Best practice, again reduce impact of OOMKiller for these workloads.
  • Enabled feature "memory-snapshot-on-shutdown" Requires Prometheus >= 2.30. Prometheus writes out a more raw snapshot of its current in-memory state upon shutdown, which can then be re-read into memory more efficiently when the server restarts. Reduce starts time in 50-80%. Faster restarts https://github.com/prometheus/prometheus/pull/7229

Summary of changes per environment and instance

Env Cluster env Prometheus instance Memory requests Memory limits CPU requests CPU Limits Dashboard
gprd gprd-gilab-gke gitlab-monitoring-promethe-prometheus 400Gi 600Gi -> 400Gi 65000m 75000m -> none Link

This MR should only affects to gprd (gprd-gitlab-gke) environment and only to (release=gitlab-monitoring), Releases release=prometheus-gitlab-app-1 and release=prometheus-gitlab-rw, should no have a diff. Those other releases will be update on a follow up MR.

About memory-snapshot-on-shutdown

  • Faster restarts, quicker recovery. It also works when the different probes fails for different reasons and Prometheus is SIGTERM.
  • Snapshots will take additional space.
  • Depending on how many series you have and the write speed of the disk, shutdown can take a little time. Therefore, we need to adjust pod termination grace period but, that is a setting that still is not supported in the Prometheus Operator https://github.com/prometheus-operator/prometheus-operator/issues/3433. At the moment, is hardcode to 10m (600s) which I believe will be more than enough for this use case. As I rolling out the setting, will test this assumption.

Additional notes

Our peak memory usage is around 221Gi, this is a big instance. Prometheus uses a lot of memory during crash recovery to process the WAL file. At the moment I keep the 400Gi memory limit. We might able to reduce than number after testing crash recovery in prometheus.

Rule of thumb, ~double of the memory using at peak.

Edited by Raúl Naveiras

Merge request reports