Skip to content

feat(webservice): better readinessProbe defaults

Steve Xuereb requested to merge feat/increase-healthcheck into master

What does this MR do?

What

Increase the frequency of the readiness probes to remove/add webservice pods quicker from the service/endpoint.

  • Increase the frequency of readinessProbe.
  • Reduce the total number of failures required to consider a pod unhealthy.

Why

In https://gitlab.com/gitlab-com/gl-infra/reliability/-/issues/15497 we saw pods taking a long time from being removed from a service because if the webservice container fails we end up waiting 30 seconds (3 failures * 10 second interval), but with the new settings we will fail in 4 seconds (2 failures * 2 second interval). We've seen an improvement with this in https://gitlab.com/gitlab-com/gl-infra/reliability/-/issues/15497#note_905993144 and doesn't trigger GitLab.com SLO anymore. In gitlab-com/gl-infra/k8s-workloads/gitlab-com!1689 (comment 905961189) we see there was only a slight uptick in request rate on the /-/readiness endpoint and little to no resource increase. Looking at the HealthController we only use the default checks which is doing a pipe read so it seems it's safe to increase the frequency here.

Related issues

https://gitlab.com/gitlab-com/gl-infra/reliability/-/issues/15497

Checklist

See Definition of done.

For anything in this list which will not be completed, please provide a reason in the MR discussion.

Required

  • Merge Request Title and Description are up to date, accurate, and descriptive
  • MR targeting the appropriate branch
  • MR has a green pipeline on GitLab.com

Expected (please provide an explanation if not completing)

  • Test plan indicating conditions for success has been posted and passes
  • Documentation created/updated
  • Tests added
  • Integration tests added to GitLab QA
  • Equivalent MR/issue for omnibus-gitlab opened
Edited by Jason Plum

Merge request reports