Use startupProbe for Prometheus pods
Summary
Start using the startupProbe
for our Prometheus pods, so start livenessProbes
and redinessProbes
don't time out when there is a large wal replay happening. We tried increasing the treshhold of failures for liveness/readiness but this is not sufficient and results into trashing of the containers
startupProbe
is not available for us in the current prometheus-operator, we are currently blocked by https://gitlab.com/gitlab-com/gl-infra/infrastructure/-/issues/13973
Related Incident(s)
Originating issue(s): production#5466 (closed)
Desired Outcome/Acceptance criteria
-
Start using startupProbe and readiness as defined in 👉 https://gitlab.com/gitlab-com/gl-infra/infrastructure/-/issues/14359#note_701160878-
Enable it on gstg
👉 gitlab-com/gl-infra/k8s-workloads/gitlab-helmfiles!511 (merged) -
Enable it on ops
,pre
,org-ci
👉 gitlab-com/gl-infra/k8s-workloads/gitlab-helmfiles!513 (merged) -
Enable it on gprd
👉 gitlab-com/gl-infra/k8s-workloads/gitlab-helmfiles!514 (merged)
-
-
Update monitoring to check for pods failing to start up or crashloop backoff for longer the 12hours and create an issue automatically 👉 gitlab-com/runbooks!4072 (merged)
Associated Services
Corrective Action Issue Checklist
-
link the incident(s) this corrective action arose out of -
give context for what problem this corrective action is trying to prevent from re-occurring -
assign a severity label (this is the highest sev of related incidents, defaults to 'severity::4') -
assign a priority (this will default to 'priority::4')
Edited by Steve Xuereb