[prometheus] gstg - Prometheus resource audit (!3427) · Merge requests · GitLab.com / GitLab Infrastructure Team / Kubernetes Workloads / GitLab Helmfiles

What and Why

Add memory limits. No having a memory limit in kubernetes increases the chance of being killed by the OOMKiller.
Memory Request = Memory Limit - Best practice, again reduce impact of OOMKiller for these workloads.
Enabled feature "memory-snapshot-on-shutdown" Requires Prometheus >= 2.30. Prometheus writes out a more raw snapshot of its current in-memory state upon shutdown, which can then be re-read into memory more efficiently when the server restarts. Reduce starts time in 50-80%. Faster restarts https://github.com/prometheus/prometheus/pull/7229

Env	Cluster env	Prometheus instance	Memory requests	Memory limits	CPU requests	CPU Limits	Dashboard
gstg	gstg-gilab-gke	gitlab-monitoring-promethe-prometheus	10Gi -> 15Gi	none -> 15Gi	1500m	none	Link
gstg	gstg-us-east1-b	gitlab-monitoring-promethe-prometheus	10Gi	none -> 10Gi	1500m	none	Link
gstg	gstg-us-east1-c	gitlab-monitoring-promethe-prometheus	10Gi	none -> 10Gi	1500m	none	Link
gstg	gstg-us-east1-d	gitlab-monitoring-promethe-prometheus	10Gi	none -> 10Gi	1500m	none	Link
gstg	gstg-gilab-gke	gitlab-rw-prometheus	10Gi -> 15Gi	none -> 15Gi	1500m	none	Link
gstg	gstg-us-east1-b	gitlab-rw-prometheus	10 Gi -> 7Gi	none -> 7Gi	1500m	none	Link
gstg	gstg-us-east1-c	gitlab-rw-prometheus	10 Gi -> 7Gi	none -> 7Gi	1500m	none	Link
gstg	gstg-us-east1-d	gitlab-rw-prometheus	10 Gi -> 7Gi	none -> 7Gi	1500m	none	Link

See memory usage last 7days here

Faster restarts, quicker recovery. It also works when the different probes fails for different reasons and Prometheus is SIGTERM.
Snapshots will take additional space.
Depending on how many series you have and the write speed of the disk, shutdown can take a little time. Therefore, we need to adjust pod termination grace period but, that is a setting that still is not supported in the Prometheus Operator https://github.com/prometheus-operator/prometheus-operator/issues/3433. At the moment, is hardcode to 10m (600s) which I believe will be more than enough for this use case. As I rolling out the setting, will test this assumption.

Edited Oct 27, 2023 by Raúl Naveiras