gstg prometheus in kubernetes OOMing
This was initially triggered by this dead man's snitch alert that @jarv escalated. Tracking down the source of that alert. It was introduced by https://gitlab.com/gitlab-com/gl-infra/infrastructure/-/issues/10786.
This led to the discovery that prometheus is crashing with an OOM on the regional gstg cluster:
➜ ~ k get pods -n monitoring
NAME READY STATUS RESTARTS AGE
prometheus-gitlab-monitoring-promethe-prometheus-0 3/4 CrashLoopBackOff 18 50m
prometheus-gitlab-monitoring-promethe-prometheus-1 3/4 CrashLoopBackOff 22 69m
➜ ~ k get pods -n monitoring prometheus-gitlab-monitoring-promethe-prometheus-0 -o json | jq '.status.containerStatuses[]|select(.name == "prometheus").lastState'
{
"terminated": {
"containerID": "docker://8b42e31f805e8b3f4a6b2d52acec58d5ac7b02002cf754fa1c7514baf74fb798",
"exitCode": 137,
"finishedAt": "2021-07-02T09:53:37Z",
"message": "[snip]",
"reason": "OOMKilled",
"startedAt": "2021-07-02T09:51:38Z"
}
}
As a mitigation, we'll want to bump the resources.
Current gstg:
- Memory requests: 10GiB
- Memory available per node: 12GiB
Current gprd:
- Memory requests: 60GiB
- Memory available per node: 54GiB
# gstg
➜ ~ k get pods -n monitoring prometheus-gitlab-monitoring-promethe-prometheus-0 -o json | jq '.spec.containers[]|select(.name == "prometheus").resources'
{
"requests": {
"cpu": "2",
"memory": "10Gi"
}
}
# gprd
➜ ~ k get pods -n monitoring prometheus-gitlab-monitoring-promethe-prometheus-0 -o json | jq '.spec.containers[]|select(.name == "prometheus").resources'
{
"requests": {
"cpu": "6",
"memory": "60Gi"
}
}
This means that if we want to bump this on gstg, we'll need to increase the underlying instance size.
We can bump gstg requests here.
Edited by Igor