feat(webservice): better readinessProbe defaults
What does this MR do?
What
Increase the frequency of the readiness probes to remove/add webservice pods quicker from the service/endpoint.
- Increase the frequency of
readinessProbe
. - Reduce the total number of failures required to consider a pod unhealthy.
Why
In https://gitlab.com/gitlab-com/gl-infra/reliability/-/issues/15497 we
saw pods taking a long time from being removed from a service because if
the webservice
container fails we end up waiting 30 seconds (3
failures * 10 second interval), but with the new settings we will fail
in 4 seconds (2 failures * 2 second interval). We've seen an improvement
with this in
https://gitlab.com/gitlab-com/gl-infra/reliability/-/issues/15497#note_905993144
and doesn't trigger GitLab.com SLO anymore. In
gitlab-com/gl-infra/k8s-workloads/gitlab-com!1689 (comment 905961189)
we see there was only a slight uptick in request rate on the
/-/readiness
endpoint and little to no resource increase. Looking at
the
HealthController
we only use the default
checks
which is doing a pipe
read
so it seems it's safe to increase the frequency here.
Related issues
https://gitlab.com/gitlab-com/gl-infra/reliability/-/issues/15497
Checklist
See Definition of done.
For anything in this list which will not be completed, please provide a reason in the MR discussion.
Required
-
Merge Request Title and Description are up to date, accurate, and descriptive -
MR targeting the appropriate branch -
MR has a green pipeline on GitLab.com
Expected (please provide an explanation if not completing)
-
Test plan indicating conditions for success has been posted and passes -
Documentation created/updated -
Tests added -
Integration tests added to GitLab QA -
Equivalent MR/issue for omnibus-gitlab opened