gstg prometheus in kubernetes OOMing

This was initially triggered by this dead man's snitch alert that @jarv escalated. Tracking down the source of that alert. It was introduced by https://gitlab.com/gitlab-com/gl-infra/infrastructure/-/issues/10786.

This led to the discovery that prometheus is crashing with an OOM on the regional gstg cluster:

➜  ~ k get pods -n monitoring
NAME                                                    READY   STATUS             RESTARTS   AGE
prometheus-gitlab-monitoring-promethe-prometheus-0      3/4     CrashLoopBackOff   18         50m
prometheus-gitlab-monitoring-promethe-prometheus-1      3/4     CrashLoopBackOff   22         69m

➜  ~ k get pods -n monitoring prometheus-gitlab-monitoring-promethe-prometheus-0 -o json | jq '.status.containerStatuses[]|select(.name == "prometheus").lastState'
{
  "terminated": {
    "containerID": "docker://8b42e31f805e8b3f4a6b2d52acec58d5ac7b02002cf754fa1c7514baf74fb798",
    "exitCode": 137,
    "finishedAt": "2021-07-02T09:53:37Z",
    "message": "[snip]",
    "reason": "OOMKilled",
    "startedAt": "2021-07-02T09:51:38Z"
  }
}

As a mitigation, we'll want to bump the resources.

Current gstg:

  • Memory requests: 10GiB
  • Memory available per node: 12GiB

Current gprd:

  • Memory requests: 60GiB
  • Memory available per node: 54GiB
# gstg
➜  ~ k get pods -n monitoring prometheus-gitlab-monitoring-promethe-prometheus-0 -o json | jq '.spec.containers[]|select(.name == "prometheus").resources'
{
  "requests": {
    "cpu": "2",
    "memory": "10Gi"
  }
}

# gprd
➜  ~ k get pods -n monitoring prometheus-gitlab-monitoring-promethe-prometheus-0 -o json | jq '.spec.containers[]|select(.name == "prometheus").resources'
{
  "requests": {
    "cpu": "6",
    "memory": "60Gi"
  }
}

This means that if we want to bump this on gstg, we'll need to increase the underlying instance size.

We can bump gstg requests here.

Edited by Igor