feat(prometheus): enable startupProbe pre, ops, org-ci
Background
Sometimes Prometheus has a large WAL to replay and can take a long time to process. Since the liveness and readiness probes have a short threshold by default it ends up killing the prometheus pod constantly because it never has time to replay the WAL.
Solution
Introduce startupProbe
which runs before every other probe, and wait
for that to successed before moving onto the readiness probe. The goal
here is to give enough time to Prometheus to replay a large WAL
file.
To get more information about startupProbe
and readinessProbe
run
the following commands:
kubectl explain pod.spec.containers.startupProbe
kubectl explain pod.spec.containers.readinessProbe
reference https://gitlab.com/gitlab-com/gl-infra/infrastructure/-/issues/14359