Prometheus recovery on WAL file corruption

Overview

In production#5998 (closed) we started seeing Prometheus servers being OOM killed, upon further investigation, we see that there is a large WAL file and chunks_head because WAL failed to flush. The root cause was because of a curropted WAL file. To get more context around the incident itself it's suggested to go through the incident review.

We can look at the following stack to understand how Prometheus is reading the WAL file (there might be some assumption so might not be 100% correct)

FlushWAL is called
FlushWAL calls head.Init
We start reading the WAL file
We failed to read mmappedchucnks. We see this error as we see in the logs
Prometheus called removeCorruptedMmappedChunks
DeleteCorrupted fails as we can see in the cannot handle error in the log
WAL keeps building up because it can't move onto the next segment.

Questions we need to answer

Can Prometheus recover, rather than stall on WAL corruption so we don't end up with a large wal and chunks_head file?
Is this something that we can upstream to Prometheus, or at least file a bug report to them?
1. For us to file a bug report we should have a reproducible test/environment to get our point across.

Reproducing the problem

We took a disk snapshot called incident-production-5998 and a disk called incident-production-5998 we can add this disk on a GCP instance (instructions to follow) and try and reproduce the problem.

We can call the FlushWAL directly in a small Go program.

db, err := tsdb.OpenDBReadOnly("data", nil)
if err != nil {
	panic(err)
}
defer db.Close()

err = db.FlushWAL("data")
if err != nil {
	panic(err)
}

Related GitHub issues of similar effect

Why focus on recovery rather than prevention?

One could argue that time is more well spent preventing the WAL file from being corrupted rather than focusing on recovering. However WAL corruption can happen at any time for multiple reasons (disk issues, server crash, metrics out of order, tsdb bug) so we need to make sure when it happens not if we recovery quickly and only lose a small number of metrics.

Production incidents this caused

Learnings

Results

https://github.com/prometheus/prometheus/issues/10043
- https://github.com/prometheus/prometheus/pull/10406

Edited Mar 22, 2022 by Steve Xuereb