Use startupProbe for Prometheus pods

Summary

Start using the startupProbe for our Prometheus pods, so start livenessProbes and redinessProbes don't time out when there is a large wal replay happening. We tried increasing the treshhold of failures for liveness/readiness but this is not sufficient and results into trashing of the containers

startupProbe is not available for us in the current prometheus-operator, we are currently blocked by https://gitlab.com/gitlab-com/gl-infra/infrastructure/-/issues/13973

Related Incident(s)

Originating issue(s): production#5466 (closed)

Desired Outcome/Acceptance criteria

Start using startupProbe and readiness as defined in 👉 https://gitlab.com/gitlab-com/gl-infra/infrastructure/-/issues/14359#note_701160878
- Enable it on gstg 👉 gitlab-com/gl-infra/k8s-workloads/gitlab-helmfiles!511 (merged)
- Enable it on ops, pre, org-ci 👉 gitlab-com/gl-infra/k8s-workloads/gitlab-helmfiles!513 (merged)
- Enable it on gprd 👉 gitlab-com/gl-infra/k8s-workloads/gitlab-helmfiles!514 (merged)
Update monitoring to check for pods failing to start up or crashloop backoff for longer the 12hours and create an issue automatically 👉 gitlab-com/runbooks!4072 (merged)

Associated Services

ServicePrometheus

Corrective Action Issue Checklist

link the incident(s) this corrective action arose out of
give context for what problem this corrective action is trying to prevent from re-occurring
assign a severity label (this is the highest sev of related incidents, defaults to 'severity::4')
assign a priority (this will default to 'priority::4')

Edited Nov 09, 2021 by Steve Xuereb