Add more robust Puma readiness check
What does this MR do?
This merge request enables the Puma control app, which listens on a separate
port (9293 by default and configurable via PUMA_CONTROL_PORT
) in the
Puma master process. This endpoint provides data about the health of the
worker processes as well as the thread status.
The current readiness endpoint, /-/readiness
, may be slow
to respond if Puma is busy, so using a separate endpoint that
has insight into the status of the threads and processes may
be more robust and avoid flapping readiness failures.
In addition, this merge request introduces a puma-health-check.rb
readiness script that:
- Checks the status of the control app.
- If all passes, then makes a request to
/-/readiness
. Slow requests (> 1 s) are allowed for up to 60 seconds to tolerate busy workers.
With the migration to Cilium in GKE Dataplane V2, failing a readiness check may cause existing TCP connections to be severed. We need to be sure that the pod is really in a bad state for this to happen.
Related issues
Relates to gitlab-com/gl-infra/production#20469
How to validate locally
I applied this diff to the Helm Chart:
diff --git a/charts/gitlab/charts/webservice/templates/deployment.yaml b/charts/gitlab/charts/webservice/templates/deployment.yaml
index 1ca48e527..c458827ff 100644
--- a/charts/gitlab/charts/webservice/templates/deployment.yaml
+++ b/charts/gitlab/charts/webservice/templates/deployment.yaml
@@ -329,14 +329,8 @@ spec:
successThreshold: {{ .deployment.livenessProbe.successThreshold }}
failureThreshold: {{ .deployment.livenessProbe.failureThreshold }}
readinessProbe:
- httpGet:
- path: {{ $.Values.global.appConfig.relativeUrlRoot }}/-/readiness
- {{- if and (not $.Values.http.enabled) $.Values.tls.enabled }}
- scheme: HTTPS
- port: {{ $.Values.service.tls.internalPort }}
- {{- else }}
- port: {{ $.Values.service.internalPort }}
- {{- end }}
+ exec:
+ command: ["/scripts/puma-health-check.rb"]
initialDelaySeconds: {{ .deployment.readinessProbe.initialDelaySeconds }}
periodSeconds: {{ .deployment.readinessProbe.periodSeconds }}
timeoutSeconds: {{ .deployment.readinessProbe.timeoutSeconds }}
diff --git a/charts/gitlab/charts/webservice/values.yaml b/charts/gitlab/charts/webservice/values.yaml
index 34c55e0e9..3efa65a70 100644
--- a/charts/gitlab/charts/webservice/values.yaml
+++ b/charts/gitlab/charts/webservice/values.yaml
@@ -197,7 +197,7 @@ deployment:
readinessProbe:
initialDelaySeconds: 0
periodSeconds: 5
- timeoutSeconds: 2
+ timeoutSeconds: 3
successThreshold: 1
failureThreshold: 2
Then I ran
Checklist
See Definition of done.
For anything in this list which will not be completed, please provide a reason in the MR discussion
Required
-
Merge Request Title, and Description are up to date, accurate, and descriptive -
MR targeting the appropriate branch -
MR has a green pipeline on GitLab.com -
When ready for review, MR is labeled "~workflow::ready for review" per the Distribution MR workflow
Expected (please provide an explanation if not completing)
-
Test plan indicating conditions for success has been posted and passes -
Documentation created/updated -
Integration tests added to GitLab QA -
The impact any change in container size has should be evaluated -
New dependencies are managed with GitLab forked renovatebot