[prometheus] gprd-us-east1-d Prometheus resources (!3508) · Merge requests · GitLab.com / GitLab Infrastructure Team / Kubernetes Workloads / GitLab Helmfiles

Raúl Naveiras requested to merge rnaveiras-prometheus-audit-gprd-us-east1-d into master Nov 01, 2023

What and Why

Issue: https://gitlab.com/gitlab-com/gl-infra/reliability/-/issues/24579

Common changes and best practices

Memory Request = Memory Limit - Best practice, again reduce impact of OOMKiller for these workloads.
Enabled feature "memory-snapshot-on-shutdown" Requires Prometheus >= 2.30. Prometheus writes out a more raw snapshot of its current in-memory state upon shutdown, which can then be re-read into memory more efficiently when the server restarts. Reduce starts time in 50-80%. Faster restarts https://github.com/prometheus/prometheus/pull/7229

Summary of changes per environment and instance

Env	Cluster env	Prometheus instance	Memory requests	Memory limits	CPU requests	CPU Limits	Dashboard
gprd	gprd-us-east1-d	gitlab-monitoring-promethe-prometheus	400Gi	600Gi -> 400Gi	65000m	75000m -> none	Link

About memory-snapshot-on-shutdown

Faster restarts, quicker recovery. It also works when the different probes fails for different reasons and Prometheus is SIGTERM.
Snapshots will take additional space.
Depending on how many series you have and the write speed of the disk, shutdown can take a little time. Therefore, we need to adjust pod termination grace period but, that is a setting that still is not supported in the Prometheus Operator https://github.com/prometheus-operator/prometheus-operator/issues/3433. At the moment, is hardcode to 10m (600s) which I believe will be more than enough for this use case. As I rolling out the setting, will test this assumption.

Additional notes

Our peak memory usage is around 268Gi, align with cpu throttling and peak active time series close to 55 million. Prometheus uses a lot of memory during crash recovery to process the WAL file. At the moment I keep the 400Gi memory limit. We might able to reduce than number after testing crash recovery in prometheus.

Rule of thumb, ~double of the memory using at peak.

[prometheus] gprd-us-east1-d Prometheus resources

What and Why

Common changes and best practices

Summary of changes per environment and instance

About memory-snapshot-on-shutdown

Additional notes

Merge request reports