Prometheus pods OOM-killed while restarting with large WAL

Summary

This issue has been observed on a internal platform, prometheus pod was failing to start over after a rolling update as it was OOM-killed while it was trying to replay its WAL:

 Containers:                                                                                                                                                                                                                                                                                                              │
│   prometheus:                                                                                                                                                                                                                                                                                                            │
│     Container ID:  containerd://540c3fbbf76596a729da061a15777e0dfd10359d6d8ad1e79c957ecd98cbcba8                                                                                                                                                                                                                         │
│     Image:         docker.io/rancher/prom-prometheus:v2.55.1                                                                                                                                                                                                                                                             │
│     Image ID:      docker.io/rancher/prom-prometheus@sha256:e864ec7cd5eb2d603ab549b80bd10e6d8c6a02949dc25e8ae14b0d2f8e63139a                                                                                                                                                                                             │
│     Port:          9090/TCP                                                                                                                                                                                                                                                                                              │
│     Host Port:     0/TCP                                                                                                                                                                                                                                                                                                 │
│     Args:                                                                                                                                                                                                                                                                                                                │
│       --web.console.templates=/etc/prometheus/consoles                                                                                                                                                                                                                                                                   │
│       --web.console.libraries=/etc/prometheus/console_libraries                                                                                                                                                                                                                                                          │
│       --config.file=/etc/prometheus/config_out/prometheus.env.yaml                                                                                                                                                                                                                                                       │
│       --web.enable-lifecycle                                                                                                                                                                                                                                                                                             │
│       --web.external-url=http://rancher-monitoring-prometheus.cattle-monitoring-system:9090                                                                                                                                                                                                                              │
│       --web.route-prefix=/                                                                                                                                                                                                                                                                                               │
│       --storage.tsdb.retention.time=10d                                                                                                                                                                                                                                                                                  │
│       --storage.tsdb.path=/prometheus                                                                                                                                                                                                                                                                                    │
│       --storage.tsdb.wal-compression                                                                                                                                                                                                                                                                                     │
│       --web.config.file=/etc/prometheus/web_config/web-config.yaml                                                                                                                                                                                                                                                       │
│     State:       Waiting                                                                                                                                                                                                                                                                                                 │
│       Reason:    CrashLoopBackOff                                                                                                                                                                                                                                                                                        │
│     Last State:  Terminated                                                                                                                                                                                                                                                                                              │
│       Reason:    OOMKilled                                                                                                                                                                                                                                                                                               │
│       Message:   T14:10:19.903Z caller=head.go:794 level=info component=tsdb msg="WAL segment loaded" segment=631 maxSegment=644                                                                                                                                                                                         │
│ ts=2025-10-24T14:10:19.904Z caller=head.go:794 level=info component=tsdb msg="WAL segment loaded" segment=632 maxSegment=644                                                                                                                                                                                             │
│ ts=2025-10-24T14:10:19.905Z caller=head.go:794 level=info component=tsdb msg="WAL segment loaded" segment=633 maxSegment=644                                                                                                                                                                                             │
│ ts=2025-10-24T14:10:19.905Z caller=head.go:794 level=info component=tsdb msg="WAL segment loaded" segment=634 maxSegment=644                                                                                                                                                                                             │
│ ts=2025-10-24T14:10:19.906Z caller=head.go:794 level=info component=tsdb msg="WAL segment loaded" segment=635 maxSegment=644                                                                                                                                                                                             │
│ ts=2025-10-24T14:10:19.907Z caller=head.go:794 level=info component=tsdb msg="WAL segment loaded" segment=636 maxSegment=644                                                                                                                                                                                             │
│ ts=2025-10-24T14:10:19.908Z caller=head.go:794 level=info component=tsdb msg="WAL segment loaded" segment=637 maxSegment=644                                                                                                                                                                                             │
│ ts=2025-10-24T14:10:19.908Z caller=head.go:794 level=info component=tsdb msg="WAL segment loaded" segment=638 maxSegment=644                                                                                                                                                                                             │
│ ts=2025-10-24T14:10:19.909Z caller=head.go:794 level=info component=tsdb msg="WAL segment loaded" segment=639 maxSegment=644                                                                                                                                                                                             │
│ ts=2025-10-24T14:10:19.910Z caller=head.go:794 level=info component=tsdb msg="WAL segment loaded" segment=640 maxSegment=644                                                                                                                                                                                             │
│ ts=2025-10-24T14:10:19.911Z caller=head.go:794 level=info component=tsdb msg="WAL segment loaded" segment=641 maxSegment=644                                                                                                                                                                                             │
│ ts=2025-10-24T14:10:19.911Z caller=head.go:794 level=info component=tsdb msg="WAL segment loaded" segment=642 maxSegment=644                                                                                                                                                                                             │
│ ts=2025-10-24T14:10:19.912Z caller=head.go:794 level=info component=tsdb msg="WAL segment loaded" segment=643 maxSegment=644                                                                                                                                                                                             │
│ ts=2025-10-24T14:10:19.913Z caller=head.go:794 level=info component=tsdb msg="WAL segment loaded" segment=644 maxSegment=644                                                                                                                                                                                             │
│ ts=2025-10-24T14:10:19.913Z caller=head.go:831 level=info component=tsdb msg="WAL replay completed" checkpoint_replay_duration=9.714982359s wal_replay_duration=2m3.668697992s wbl_replay_duration=158ns chunk_snapshot_load_duration=0s mmap_chunk_replay_duration=4.112881093s total_replay_duration=2m17.496645357s   │
│                                                                                                                                                                                                                                                                                                                          │
│       Exit Code:    137                                                                                                                                                                                                                                                                                                  │
│       Started:      Fri, 24 Oct 2025 14:08:02 +0000                                                                                                                                                                                                                                                                      │
│       Finished:     Fri, 24 Oct 2025 14:10:25 +0000                                                                                                                                                                                                                                                                      │
│     Ready:          False                                                                                                                                                                                                                                                                                                │
│     Restart Count:  405

There are many references to that memory consumption spike while replaying WAL:

  • https://github.com/prometheus/prometheus/issues/6934
  • https://github.com/prometheus/prometheus/issues/7955

We should maybe adapt the default memory limit to cope with it

Assignee Loading
Time tracking Loading