Prometheus recovery on WAL file corruption
Overview
In production#5998 (closed) we started seeing Prometheus servers being OOM killed, upon further investigation, we see that there is a large WAL file and chunks_head because WAL failed to flush. The root cause was because of a curropted WAL file. To get more context around the incident itself it's suggested to go through the incident review.
We can look at the following stack to understand how Prometheus is reading the WAL file (there might be some assumption so might not be 100% correct)
-
FlushWALis called -
FlushWALcalls head.Init - We start reading the WAL file
- We failed to read mmappedchucnks. We see this error as we see in the logs
- Prometheus called removeCorruptedMmappedChunks
-
DeleteCorruptedfails as we can see in thecannot handle errorin the log - WAL keeps building up because it can't move onto the next segment.
Questions we need to answer
- Can Prometheus recover, rather than stall on WAL corruption so we don't end up with a large
walandchunks_headfile? - Is this something that we can upstream to Prometheus, or at least file a bug report to them?
- For us to file a bug report we should have a reproducible test/environment to get our point across.
Reproducing the problem
We took a disk snapshot called incident-production-5998 and a disk called incident-production-5998 we can add this disk on a GCP instance (instructions to follow) and try and reproduce the problem.
We can call the FlushWAL directly in a small Go program.
db, err := tsdb.OpenDBReadOnly("data", nil)
if err != nil {
panic(err)
}
defer db.Close()
err = db.FlushWAL("data")
if err != nil {
panic(err)
}
Related GitHub issues of similar effect
- https://github.com/prometheus/prometheus/issues/6408
- https://github.com/prometheus/prometheus/issues/6655
- https://github.com/prometheus/prometheus/issues/8140
Why focus on recovery rather than prevention?
One could argue that time is more well spent preventing the WAL file from being corrupted rather than focusing on recovering. However WAL corruption can happen at any time for multiple reasons (disk issues, server crash, metrics out of order, tsdb bug) so we need to make sure when it happens not if we recovery quickly and only lose a small number of metrics.
Production incidents this caused
- production#5998 (closed)
- production#6148 (closed)
- production#6242 (closed)
- production#6272 (closed)
- production#6354 (closed)
- production#6349 (closed)
Learnings
- Make Prometheus Use Less Memory and Restart Faster - Ganesh Vernekar, Grafana Labs
- tsdb internal's documentation
- PromCon Online 2020 - TSDB WTF, Ian Billett, Improbable