feat(webservice): better readinessProbe defaults (!2518) · Merge requests · GitLab.org / charts / GitLab Chart

What does this MR do?

Increase the frequency of the readiness probes to remove/add webservice pods quicker from the service/endpoint.

In https://gitlab.com/gitlab-com/gl-infra/reliability/-/issues/15497 we saw pods taking a long time from being removed from a service because if the webservice container fails we end up waiting 30 seconds (3 failures * 10 second interval), but with the new settings we will fail in 4 seconds (2 failures * 2 second interval). We've seen an improvement with this in https://gitlab.com/gitlab-com/gl-infra/reliability/-/issues/15497#note_905993144 and doesn't trigger GitLab.com SLO anymore. In gitlab-com/gl-infra/k8s-workloads/gitlab-com!1689 (comment 905961189) we see there was only a slight uptick in request rate on the /-/readiness endpoint and little to no resource increase. Looking at the HealthController we only use the default checks which is doing a pipe read so it seems it's safe to increase the frequency here.

For anything in this list which will not be completed, please provide a reason in the MR discussion.

Edited Apr 22, 2022 by Jason Plum