2021-12-02: Prometheus - GKE gprd-us-east1-b has gone missing
Current Status
All prometheus servers in gprd-us-east1-b
are up and running.
Timeline
Recent Events (available internally only):
- Deployments
- Feature Flag Changes
- Infrastructure Configurations
- GCP Events (e.g. host failure)
All times UTC.
2021-12-02
-
15:46
- @cmcfarland declares incident in Slack. -
15:55
- downgraded to S2 after confirming that there is no customer impact -
15:57
- downgraded to S3 after confirming that the alert is related to Prometheus k8s cluster, which does not have an impact on users.
Corrective Actions
Corrective actions should be put here as soon as an incident is mitigated, ensure that all corrective actions mentioned in the notes below are included.
- ...
Note: In some cases we need to redact information from public view. We only do this in a limited number of documented cases. This might include the summary, timeline or any other bits of information, laid out in out handbook page. Any of this confidential data will be in a linked issue, only visible internally. By default, all information we can share, will be public, in accordance to our transparency value.
Incident Review
-
Ensure that the exec summary is completed at the top of the incident issue, the timeline is updated and relevant graphs are included in the summary -
If there are any corrective action items mentioned in the notes on the incident, ensure they are listed in the "Corrective Action" section -
Fill out relevant sections below or link to the meeting review notes that cover these topics
Customer Impact
-
Who was impacted by this incident? (i.e. external customers, internal customers)
- SREs
-
What was the customer experience during the incident? (i.e. preventing them from doing X, incorrect display of Y, ...)
- None
-
How many customers were affected?
- None
-
If a precise customer impact number is unknown, what is the estimated impact (number and ratio of failed requests, amount of traffic drop, ...)?
- None
What were the root causes?
The root cause is that the WAL file was corrupted and it couldn't flush the WAL so memory pressure kept building up until we started hitting OOM kills.
How we found the root cause
- Prometheus WAL got corrupted with torn records.
- When WAL was curropted we started seeing errors like
cannot handle error: iterate on on-disk chunks: out of sequence m-mapped chunk for series ref 2667074520
- Diving at the Prometheus code, we have the following stack
-
FlushWAL
is called -
FlushWAL
calls head.Init - We start reading the WAL file
- We failed to read mmappedchucnks. We see this error as we see in the logs
- Prometheus called removeCorruptedMmappedChunks
-
DeleteCorrupted
fails as we can see in thecannot handle error
in the log - WAL keeps building up because it can't move onto the next segment.
-
How we recovered
We ended up deleting the WAL file on disk since it was a "lost cause" to fix the WAL file
- Delete WAL
👉 #5998 (comment 751864526) - We had to wait a few hours before
head_chunks
were cleared up.
Incident Response Analysis
-
How was the incident detected?
- SRE paged because of missing metrics in
us-east1-b
- SRE paged because of missing metrics in
-
How could detection time be improved?
- N/A
-
How was the root cause diagnosed?
- Digging into the source code of Prometheus to understand why WAL wasn't being flushed.
-
How could time to diagnosis be improved?
- Understanding why the WAL file was not being flushed.
-
How did we reach the point where we knew how to mitigate the impact?
- Reading the source code of the file.
- How could time to mitigation be improved? 1.
-
What went well?
- We have 2 pairs of Prometheus servers, we could fully recover one, and keep investigating on the other one which gave breathing room to the engineers.
Post Incident Analysis
-
Did we have other events in the past with the same root cause?
- We had #5526 (closed) with similar symptoms but the root cause of #5526 (closed) wasn't known it might be the same.
-
Do we have existing backlog items that would've prevented or greatly reduced the impact of this incident?
- No
-
Was this incident triggered by a change (deployment of code or change to infrastructure)? If yes, link the issue.
- No
Lessons Learned
- How WAL file is read in Prometheus, we can read about this in detail in #5998 (comment 751818295)
-
chunks_head
is cleared up every 2 hours thanks tomax-block-duration
. - Reading the source Prometheus source code and correlating the log message might be helpful to understand what is going on.
- Looking at https://prometheus-gke.gprd-us-east1-b.gitlab.net/flags shows all the flags passed to prometheus.
- How to take snapshots of PVC.
- Prometheus starts a read-only TSDB for reading WAL
Guidelines
Resources
- If the Situation Zoom room was utilised, recording will be automatically uploaded to Incident room Google Drive folder (private)