[prometheus] gprd - Prometheus resource audit (only gprd-gitlab-gke release=gitlab-monitoring) (!3466) · Merge requests · GitLab.com / GitLab Infrastructure Team / Kubernetes Workloads / GitLab Helmfiles

Raúl Naveiras requested to merge rnaveiras-prometheus-audit-gprd into master Oct 30, 2023

What and Why

Issue: https://gitlab.com/gitlab-com/gl-infra/reliability/-/issues/24579

~~Needs: !3427 (merged) merge and rebase this MR~~

Common changes and best practices

Add memory limits. No having a memory limit in kubernetes increases the chance of being killed by the OOMKiller.
Memory Request = Memory Limit - Best practice, again reduce impact of OOMKiller for these workloads.
Enabled feature "memory-snapshot-on-shutdown" Requires Prometheus >= 2.30. Prometheus writes out a more raw snapshot of its current in-memory state upon shutdown, which can then be re-read into memory more efficiently when the server restarts. Reduce starts time in 50-80%. Faster restarts https://github.com/prometheus/prometheus/pull/7229

Summary of changes per environment and instance

Env	Cluster env	Prometheus instance	Memory requests	Memory limits	CPU requests	CPU Limits	Dashboard
gprd	gprd-gilab-gke	gitlab-monitoring-promethe-prometheus	400Gi	600Gi -> 400Gi	65000m	75000m -> none	Link

This MR should only affects to gprd (gprd-gitlab-gke) environment and only to (release=gitlab-monitoring), Releases release=prometheus-gitlab-app-1 and release=prometheus-gitlab-rw, should no have a diff. Those other releases will be update on a follow up MR.

About memory-snapshot-on-shutdown

Faster restarts, quicker recovery. It also works when the different probes fails for different reasons and Prometheus is SIGTERM.
Snapshots will take additional space.
Depending on how many series you have and the write speed of the disk, shutdown can take a little time. Therefore, we need to adjust pod termination grace period but, that is a setting that still is not supported in the Prometheus Operator https://github.com/prometheus-operator/prometheus-operator/issues/3433. At the moment, is hardcode to 10m (600s) which I believe will be more than enough for this use case. As I rolling out the setting, will test this assumption.

Additional notes

Our peak memory usage is around 221Gi, this is a big instance. Prometheus uses a lot of memory during crash recovery to process the WAL file. At the moment I keep the 400Gi memory limit. We might able to reduce than number after testing crash recovery in prometheus.

Rule of thumb, ~double of the memory using at peak.

Edited Oct 31, 2023 by Raúl Naveiras

[prometheus] gprd - Prometheus resource audit (only gprd-gitlab-gke release=gitlab-monitoring)

What and Why

Common changes and best practices

Summary of changes per environment and instance

About memory-snapshot-on-shutdown

Additional notes

Merge request reports