2021-12-02: Prometheus - GKE gprd-us-east1-b has gone missing

Current Status

All prometheus servers in gprd-us-east1-b are up and running.

Timeline

Recent Events (available internally only):

All times UTC.

2021-12-02

15:46 - @cmcfarland declares incident in Slack.
15:55 - downgraded to S2 after confirming that there is no customer impact
15:57 - downgraded to S3 after confirming that the alert is related to Prometheus k8s cluster, which does not have an impact on users.

Corrective Actions

Corrective actions should be put here as soon as an incident is mitigated, ensure that all corrective actions mentioned in the notes below are included.

Note: In some cases we need to redact information from public view. We only do this in a limited number of documented cases. This might include the summary, timeline or any other bits of information, laid out in out handbook page. Any of this confidential data will be in a linked issue, only visible internally. By default, all information we can share, will be public, in accordance to our transparency value.

Incident Review

Ensure that the exec summary is completed at the top of the incident issue, the timeline is updated and relevant graphs are included in the summary
If there are any corrective action items mentioned in the notes on the incident, ensure they are listed in the "Corrective Action" section
Fill out relevant sections below or link to the meeting review notes that cover these topics

Customer Impact

Who was impacted by this incident? (i.e. external customers, internal customers)
1. SREs
What was the customer experience during the incident? (i.e. preventing them from doing X, incorrect display of Y, ...)
1. None
How many customers were affected?
1. None
If a precise customer impact number is unknown, what is the estimated impact (number and ratio of failed requests, amount of traffic drop, ...)?
1. None

What were the root causes?

The root cause is that the WAL file was corrupted and it couldn't flush the WAL so memory pressure kept building up until we started hitting OOM kills.

How we found the root cause

Prometheus WAL got corrupted with torn records.
When WAL was curropted we started seeing errors like cannot handle error: iterate on on-disk chunks: out of sequence m-mapped chunk for series ref 2667074520
Diving at the Prometheus code, we have the following stack
1. FlushWAL is called
2. FlushWAL calls head.Init
3. We start reading the WAL file
4. We failed to read mmappedchucnks. We see this error as we see in the logs
5. Prometheus called removeCorruptedMmappedChunks
6. DeleteCorrupted fails as we can see in the cannot handle error in the log
7. WAL keeps building up because it can't move onto the next segment.

How we recovered

We ended up deleting the WAL file on disk since it was a "lost cause" to fix the WAL file

Delete WAL 👉 #5998 (comment 751864526)
We had to wait a few hours before head_chunks were cleared up.

Incident Response Analysis

How was the incident detected?
1. SRE paged because of missing metrics in us-east1-b
How could detection time be improved?
1. N/A
How was the root cause diagnosed?
1. Digging into the source code of Prometheus to understand why WAL wasn't being flushed.
How could time to diagnosis be improved?
1. Understanding why the WAL file was not being flushed.
How did we reach the point where we knew how to mitigate the impact?
1. Reading the source code of the file.
How could time to mitigation be improved? 1.
What went well?
1. We have 2 pairs of Prometheus servers, we could fully recover one, and keep investigating on the other one which gave breathing room to the engineers.

Post Incident Analysis

Did we have other events in the past with the same root cause?
1. We had #5526 (closed) with similar symptoms but the root cause of #5526 (closed) wasn't known it might be the same.
Do we have existing backlog items that would've prevented or greatly reduced the impact of this incident?
1. No
Was this incident triggered by a change (deployment of code or change to infrastructure)? If yes, link the issue.
1. No

Lessons Learned

How WAL file is read in Prometheus, we can read about this in detail in #5998 (comment 751818295)
chunks_head is cleared up every 2 hours thanks to max-block-duration.
Reading the source Prometheus source code and correlating the log message might be helpful to understand what is going on.
Looking at https://prometheus-gke.gprd-us-east1-b.gitlab.net/flags shows all the flags passed to prometheus.
How to take snapshots of PVC.
Prometheus starts a read-only TSDB for reading WAL

Guidelines

Blameless RCA Guideline

Resources

If the Situation Zoom room was utilised, recording will be automatically uploaded to Incident room Google Drive folder (private)

Edited Dec 14, 2021 by Cameron McFarland