Prometheus pods OOM-killed while restarting with large WAL
Summary
This issue has been observed on a internal platform, prometheus pod was failing to start over after a rolling update as it was OOM-killed while it was trying to replay its WAL:
Containers: │
│ prometheus: │
│ Container ID: containerd://540c3fbbf76596a729da061a15777e0dfd10359d6d8ad1e79c957ecd98cbcba8 │
│ Image: docker.io/rancher/prom-prometheus:v2.55.1 │
│ Image ID: docker.io/rancher/prom-prometheus@sha256:e864ec7cd5eb2d603ab549b80bd10e6d8c6a02949dc25e8ae14b0d2f8e63139a │
│ Port: 9090/TCP │
│ Host Port: 0/TCP │
│ Args: │
│ --web.console.templates=/etc/prometheus/consoles │
│ --web.console.libraries=/etc/prometheus/console_libraries │
│ --config.file=/etc/prometheus/config_out/prometheus.env.yaml │
│ --web.enable-lifecycle │
│ --web.external-url=http://rancher-monitoring-prometheus.cattle-monitoring-system:9090 │
│ --web.route-prefix=/ │
│ --storage.tsdb.retention.time=10d │
│ --storage.tsdb.path=/prometheus │
│ --storage.tsdb.wal-compression │
│ --web.config.file=/etc/prometheus/web_config/web-config.yaml │
│ State: Waiting │
│ Reason: CrashLoopBackOff │
│ Last State: Terminated │
│ Reason: OOMKilled │
│ Message: T14:10:19.903Z caller=head.go:794 level=info component=tsdb msg="WAL segment loaded" segment=631 maxSegment=644 │
│ ts=2025-10-24T14:10:19.904Z caller=head.go:794 level=info component=tsdb msg="WAL segment loaded" segment=632 maxSegment=644 │
│ ts=2025-10-24T14:10:19.905Z caller=head.go:794 level=info component=tsdb msg="WAL segment loaded" segment=633 maxSegment=644 │
│ ts=2025-10-24T14:10:19.905Z caller=head.go:794 level=info component=tsdb msg="WAL segment loaded" segment=634 maxSegment=644 │
│ ts=2025-10-24T14:10:19.906Z caller=head.go:794 level=info component=tsdb msg="WAL segment loaded" segment=635 maxSegment=644 │
│ ts=2025-10-24T14:10:19.907Z caller=head.go:794 level=info component=tsdb msg="WAL segment loaded" segment=636 maxSegment=644 │
│ ts=2025-10-24T14:10:19.908Z caller=head.go:794 level=info component=tsdb msg="WAL segment loaded" segment=637 maxSegment=644 │
│ ts=2025-10-24T14:10:19.908Z caller=head.go:794 level=info component=tsdb msg="WAL segment loaded" segment=638 maxSegment=644 │
│ ts=2025-10-24T14:10:19.909Z caller=head.go:794 level=info component=tsdb msg="WAL segment loaded" segment=639 maxSegment=644 │
│ ts=2025-10-24T14:10:19.910Z caller=head.go:794 level=info component=tsdb msg="WAL segment loaded" segment=640 maxSegment=644 │
│ ts=2025-10-24T14:10:19.911Z caller=head.go:794 level=info component=tsdb msg="WAL segment loaded" segment=641 maxSegment=644 │
│ ts=2025-10-24T14:10:19.911Z caller=head.go:794 level=info component=tsdb msg="WAL segment loaded" segment=642 maxSegment=644 │
│ ts=2025-10-24T14:10:19.912Z caller=head.go:794 level=info component=tsdb msg="WAL segment loaded" segment=643 maxSegment=644 │
│ ts=2025-10-24T14:10:19.913Z caller=head.go:794 level=info component=tsdb msg="WAL segment loaded" segment=644 maxSegment=644 │
│ ts=2025-10-24T14:10:19.913Z caller=head.go:831 level=info component=tsdb msg="WAL replay completed" checkpoint_replay_duration=9.714982359s wal_replay_duration=2m3.668697992s wbl_replay_duration=158ns chunk_snapshot_load_duration=0s mmap_chunk_replay_duration=4.112881093s total_replay_duration=2m17.496645357s │
│ │
│ Exit Code: 137 │
│ Started: Fri, 24 Oct 2025 14:08:02 +0000 │
│ Finished: Fri, 24 Oct 2025 14:10:25 +0000 │
│ Ready: False │
│ Restart Count: 405
There are many references to that memory consumption spike while replaying WAL:
- https://github.com/prometheus/prometheus/issues/6934
- https://github.com/prometheus/prometheus/issues/7955
We should maybe adapt the default memory limit to cope with it