Use startupProbe for Prometheus pods
## Summary <!-- Give context for what problem this issue is trying to prevent from happening again. Provide a brief assessment of the risk (chance and impact) of the problem that this corrective action fixes, to assist with triage and prioritization. --> Start using the `startupProbe` for our Prometheus pods, so start `livenessProbes` and `redinessProbes` don't time out when there is a large wal replay happening. We tried [increasing the treshhold of failures for liveness/readiness](https://gitlab.com/gitlab-com/gl-infra/k8s-workloads/gitlab-helmfiles/-/merge_requests/455) but this is not sufficient and results into trashing of the containers `startupProbe` is not available for us in the current prometheus-operator, we are currently blocked by https://gitlab.com/gitlab-com/gl-infra/infrastructure/-/issues/13973 ## Related Incident(s) <!-- Note the originating incident(s) and link known related incidents/other issues --> Originating issue(s): https://gitlab.com/gitlab-com/gl-infra/production/-/issues/5466 ## Desired Outcome/Acceptance criteria <!-- How will you know that this issue is complete? If you have any initial thoughts on implementation details e.g. what to do or not do, gotchas, edge cases etc, please share them while they are fresh in your mind. --> - [x] Start using startupProbe and readiness as defined in :point_right: https://gitlab.com/gitlab-com/gl-infra/infrastructure/-/issues/14359#note_701160878 - [x] Enable it on `gstg` :point_right: https://gitlab.com/gitlab-com/gl-infra/k8s-workloads/gitlab-helmfiles/-/merge_requests/511 - [x] Enable it on `ops`, `pre`, `org-ci` :point_right: https://gitlab.com/gitlab-com/gl-infra/k8s-workloads/gitlab-helmfiles/-/merge_requests/513 - [x] Enable it on `gprd` :point_right: https://gitlab.com/gitlab-com/gl-infra/k8s-workloads/gitlab-helmfiles/-/merge_requests/514 - [x] Update monitoring to check for pods failing to start up or crashloop backoff for longer the 12hours and create an issue automatically :point_right: https://gitlab.com/gitlab-com/runbooks/-/merge_requests/4072 ## Associated Services <!-- Apply the appropriate services associate with this corrective action if appliable. ~Service::SERVICE_NAME --> * ~"Service::Prometheus" ## Corrective Action Issue Checklist * [x] link the incident(s) this corrective action arose out of * [x] give context for what problem this corrective action is trying to prevent from re-occurring * [x] assign a severity label (this is the highest sev of related incidents, defaults to 'severity::4') * [x] assign a priority (this will default to 'priority::4')
issue