[prometheus] gprd release=prometheus-gitlab-app-1 (!3467) · Merge requests · GitLab.com / GitLab Infrastructure Team / Kubernetes Workloads / GitLab Helmfiles

What and Why

Needs:

Add memory limits. No having a memory limit in kubernetes increases the chance of being killed by the OOMKiller.
Memory Request = Memory Limit - Best practice, again reduce impact of OOMKiller for these workloads.
Enabled feature "memory-snapshot-on-shutdown" Requires Prometheus >= 2.30. Prometheus writes out a more raw snapshot of its current in-memory state upon shutdown, which can then be re-read into memory more efficiently when the server restarts. Reduce starts time in 50-80%. Faster restarts https://github.com/prometheus/prometheus/pull/7229

Env	Cluster env	Prometheus instance	Memory requests	Memory limits	CPU requests	CPU Limits	Dashboard
gprd	gprd-gilab-gke	prometheus-prometheus-gitlab-app-1-pr-prometheus	10Gi	100Gi -> 10Gi	0.8 -> 1	75 -> none	Link

This MR should only affects to gprd (gprd-gitlab-gke) environment and only to (release=prometheus-gitlab-app-1).

Faster restarts, quicker recovery. It also works when the different probes fails for different reasons and Prometheus is SIGTERM.
Snapshots will take additional space.
Depending on how many series you have and the write speed of the disk, shutdown can take a little time. Therefore, we need to adjust pod termination grace period but, that is a setting that still is not supported in the Prometheus Operator https://github.com/prometheus-operator/prometheus-operator/issues/3433. At the moment, is hardcode to 10m (600s) which I believe will be more than enough for this use case. As I rolling out the setting, will test this assumption.

CPU usage is less than half a core, but give prometheus less than a 1 cpu core is not a good idea.

Edited Nov 01, 2023 by Raúl Naveiras