2020-09-30: The Kube Persistent Volume Claim inode has a saturation exceeding SLO and is close to its capacity limit

Summary

One of the Prometheus pods in GKE got caught in a crash loop that lead to the increase utilization of its volume claim. Freeing some disk space and increasing the resources allocated to the pod helped alleviate the problem.

Timeline

All times UTC.

2020-09-29

19:18 - Volume claim utilization for the prometheus-1 pod started increasing

2020-09-30

08:06 - Volume claim utilization for the prometheus-1 pod reaches hard SLO threshold (90%)
08:10 - An alert is triggered
08:16 - ahmad declares incident in Slack using /incident declare command.
08:20 - The alert is resolved.
08:35 - The alert is triggered again.
09:17 - We notice that a GKE Prometheus process is caught in a restart loop (OOM-ing then restarting again)
09:21 - The alert is silenced as it's starting flapping.
09:59 - We try to drop certain set of metrics from Prometheus as they are not being used by Sidekiq gitlab-com/gl-infra/k8s-workloads/gitlab-helmfiles!226 (merged)
11:22 - We consider increasing the size of the volume claim to give Prometheus some head room gitlab-com/gl-infra/k8s-workloads/gitlab-helmfiles!227 (merged).
11:34 - The volume increase doesn't seem to affect the situation.
12:19 - We set resource requests for the Prometheus pod gitlab-com/gl-infra/k8s-workloads/gitlab-helmfiles!228 (merged) to give it some head-room memeory- and CPU-wise.
13:00 - We SSH into the thanos-side car container on the pod in question to do disk clean up (*.tmp directories and 0-byte WALs)
13:03 - The prometheus containers starts successfully and volume utilization is down to 85%
13:19 - We rollback the volume disk increase gitlab-com/gl-infra/k8s-workloads/gitlab-helmfiles!230 (merged)
13:03 - Volume utilization is down to 70% after Prometheus did its regular WAL cleanup

Click to expand or collapse the Incident Review section.

Incident Review

Summary

Service(s) affected: ServicePrometheus
Team attribution: ~"team::Observability"
Minutes downtime or degradation:

Metrics

Customer Impact

Who was impacted by this incident? (i.e. external customers, internal customers)
- Mostly internal customers, no one really was affected.
What was the customer experience during the incident? (i.e. preventing them from doing X, incorrect display of Y, ...)
- N/A
How many customers were affected?
- N/A
If a precise customer impact number is unknown, what is the estimated potential impact?
- Almost non-existent as there was another Prometheus instance scraping metrics just fine.

Incident Response Analysis

How was the event detected?
- An alert was triggered.
How could detection time be improved?
- TBD
How did we reach the point where we knew how to mitigate the impact?
- By involving the Observability team, it was concluded that the Prometheus pods are under provisioned in terms of CPU and memory, plus a disk cleanup was in order.
How could time to mitigation be improved?
- Having runbooks for such situation sure helps (gitlab-com/runbooks!2817 (merged))

Post Incident Analysis

How was the root cause diagnosed?
- A mixture of looking at the pod logs, observing the memory consumption of the Pod and seeing clear OOM-ing behavior.
How could time to diagnosis be improved?
- N/A
Do we have an existing backlog item that would've prevented or greatly reduced the impact of this incident?
- No.
Was this incident triggered by a change (deployment of code or change to infrastructure. If yes, have you linked the issue which represents the change?)?
- No.

5 Whys

The volume claim utilization was growing up, why?
- Because Prometheus was caught in a crash loop than was accumulating WAL files in the process.
Why was Prometheus in a crash loop?
- It didn't have enough memory to complete a clean startup and was constantly OOM-ing.
- Also it was scraping a lot of useless metrics from Sidekiq that caused an increase of disk consumption.
Why did Prometheus didn't have enough memory?
- It K8s specs didn't have resource requests specified.

Lessons Learned

Prometheus in GKE scrape a lot of useless metrics from Sidekiq (http_request_duration_seconds_.*) and they can be dropped to save some disk space.
Prometheus in GKE doesn't have enough memory to operate.
Unactionable, but bumping the size of a volume claim doesn't work in GKE because the operator doesn't support it.

Corrective Actions

Guidelines

Blameless RCA Guideline

Edited Oct 01, 2020 by Ahmad Sherif