gstg prometheus in kubernetes OOMing

This was initially triggered by this dead man's snitch alert that @jarv escalated. Tracking down the source of that alert. It was introduced by https://gitlab.com/gitlab-com/gl-infra/infrastructure/-/issues/10786.

This led to the discovery that prometheus is crashing with an OOM on the regional gstg cluster:

➜  ~ k get pods -n monitoring
NAME                                                    READY   STATUS             RESTARTS   AGE
prometheus-gitlab-monitoring-promethe-prometheus-0      3/4     CrashLoopBackOff   18         50m
prometheus-gitlab-monitoring-promethe-prometheus-1      3/4     CrashLoopBackOff   22         69m

➜  ~ k get pods -n monitoring prometheus-gitlab-monitoring-promethe-prometheus-0 -o json | jq '.status.containerStatuses[]|select(.name == "prometheus").lastState'
{
  "terminated": {
    "containerID": "docker://8b42e31f805e8b3f4a6b2d52acec58d5ac7b02002cf754fa1c7514baf74fb798",
    "exitCode": 137,
    "finishedAt": "2021-07-02T09:53:37Z",
    "message": "[snip]",
    "reason": "OOMKilled",
    "startedAt": "2021-07-02T09:51:38Z"
  }
}

As a mitigation, we'll want to bump the resources.

Current gstg:

Memory requests: 10GiB
Memory available per node: 12GiB

Current gprd:

Memory requests: 60GiB
Memory available per node: 54GiB

# gstg
➜  ~ k get pods -n monitoring prometheus-gitlab-monitoring-promethe-prometheus-0 -o json | jq '.spec.containers[]|select(.name == "prometheus").resources'
{
  "requests": {
    "cpu": "2",
    "memory": "10Gi"
  }
}

# gprd
➜  ~ k get pods -n monitoring prometheus-gitlab-monitoring-promethe-prometheus-0 -o json | jq '.spec.containers[]|select(.name == "prometheus").resources'
{
  "requests": {
    "cpu": "6",
    "memory": "60Gi"
  }
}

This means that if we want to bump this on gstg, we'll need to increase the underlying instance size.

We can bump gstg requests here.

Edited Jul 02, 2021 by Igor