Skip to content

feat(prometheus): enable startupProbe

Steve Xuereb requested to merge feat/enable-startup-probes into master

Background

In gitlab-com/gl-infra/production#5466 (closed) we see Prometheus being constantly killed because it didn't pass the readiness check. During start time Prometheus tries to read a WAL file that sometimes can get large, in our situation we were not giving Prometheus enough time to read the WAL file before it was getting restarted again.

Solution

Use startupProbe which is a probe that runs before the readinessProbe and has a higher timeout and threshold, so if there is a large WAL file Prometheus has an hour to read this. You can read more about startupProbe at https://kubernetes.io/docs/tasks/configure-pod-container/configure-liveness-readiness-startup-probes/#define-startup-probes

This was tested in other environments before in !511 (merged) and !513 (merged)

reference https://gitlab.com/gitlab-com/gl-infra/infrastructure/-/issues/14359

Edited by Steve Xuereb

Merge request reports