Skip to content

[prometheus] gprd-us-east1-d Prometheus resources

What and Why

Issue: https://gitlab.com/gitlab-com/gl-infra/reliability/-/issues/24579

Common changes and best practices

  • Memory Request = Memory Limit - Best practice, again reduce impact of OOMKiller for these workloads.
  • Enabled feature "memory-snapshot-on-shutdown" Requires Prometheus >= 2.30. Prometheus writes out a more raw snapshot of its current in-memory state upon shutdown, which can then be re-read into memory more efficiently when the server restarts. Reduce starts time in 50-80%. Faster restarts https://github.com/prometheus/prometheus/pull/7229

Summary of changes per environment and instance

Env Cluster env Prometheus instance Memory requests Memory limits CPU requests CPU Limits Dashboard
gprd gprd-us-east1-d gitlab-monitoring-promethe-prometheus 400Gi 600Gi -> 400Gi 65000m 75000m -> none Link

About memory-snapshot-on-shutdown

  • Faster restarts, quicker recovery. It also works when the different probes fails for different reasons and Prometheus is SIGTERM.
  • Snapshots will take additional space.
  • Depending on how many series you have and the write speed of the disk, shutdown can take a little time. Therefore, we need to adjust pod termination grace period but, that is a setting that still is not supported in the Prometheus Operator https://github.com/prometheus-operator/prometheus-operator/issues/3433. At the moment, is hardcode to 10m (600s) which I believe will be more than enough for this use case. As I rolling out the setting, will test this assumption.

Additional notes

Our peak memory usage is around 268Gi, align with cpu throttling and peak active time series close to 55 million. Prometheus uses a lot of memory during crash recovery to process the WAL file. At the moment I keep the 400Gi memory limit. We might able to reduce than number after testing crash recovery in prometheus.

Rule of thumb, ~double of the memory using at peak.

Merge request reports