Skip to content

Add more robust Puma readiness check

What does this MR do?

This merge request enables the Puma control app, which listens on a separate port (9293 by default and configurable via PUMA_CONTROL_PORT) in the Puma master process. This endpoint provides data about the health of the worker processes as well as the thread status.

The current readiness endpoint, /-/readiness, may be slow to respond if Puma is busy, so using a separate endpoint that has insight into the status of the threads and processes may be more robust and avoid flapping readiness failures.

In addition, this merge request introduces a puma-health-check.rb readiness script that:

  1. Checks the status of the control app.
  2. If all passes, then makes a request to /-/readiness. Slow requests (> 1 s) are allowed for up to 60 seconds to tolerate busy workers.

With the migration to Cilium in GKE Dataplane V2, failing a readiness check may cause existing TCP connections to be severed. We need to be sure that the pod is really in a bad state for this to happen.

Related issues

Relates to gitlab-com/gl-infra/production#20469

How to validate locally

I applied this diff to the Helm Chart:

diff --git a/charts/gitlab/charts/webservice/templates/deployment.yaml b/charts/gitlab/charts/webservice/templates/deployment.yaml
index 1ca48e527..c458827ff 100644
--- a/charts/gitlab/charts/webservice/templates/deployment.yaml
+++ b/charts/gitlab/charts/webservice/templates/deployment.yaml
@@ -329,14 +329,8 @@ spec:
             successThreshold: {{ .deployment.livenessProbe.successThreshold }}
             failureThreshold: {{ .deployment.livenessProbe.failureThreshold }}
           readinessProbe:
-            httpGet:
-              path: {{ $.Values.global.appConfig.relativeUrlRoot }}/-/readiness
-              {{- if and (not $.Values.http.enabled) $.Values.tls.enabled }}
-              scheme: HTTPS
-              port: {{ $.Values.service.tls.internalPort }}
-              {{- else }}
-              port: {{ $.Values.service.internalPort }}
-              {{- end }}
+            exec:
+              command: ["/scripts/puma-health-check.rb"]
             initialDelaySeconds: {{ .deployment.readinessProbe.initialDelaySeconds }}
             periodSeconds: {{ .deployment.readinessProbe.periodSeconds }}
             timeoutSeconds: {{ .deployment.readinessProbe.timeoutSeconds }}
diff --git a/charts/gitlab/charts/webservice/values.yaml b/charts/gitlab/charts/webservice/values.yaml
index 34c55e0e9..3efa65a70 100644
--- a/charts/gitlab/charts/webservice/values.yaml
+++ b/charts/gitlab/charts/webservice/values.yaml
@@ -197,7 +197,7 @@ deployment:
   readinessProbe:
     initialDelaySeconds: 0
     periodSeconds: 5
-    timeoutSeconds: 2
+    timeoutSeconds: 3
     successThreshold: 1
     failureThreshold: 2

Then I ran

Checklist

See Definition of done.

For anything in this list which will not be completed, please provide a reason in the MR discussion

Required

  • Merge Request Title, and Description are up to date, accurate, and descriptive
  • MR targeting the appropriate branch
  • MR has a green pipeline on GitLab.com
  • When ready for review, MR is labeled "~workflow::ready for review" per the Distribution MR workflow

Expected (please provide an explanation if not completing)

  • Test plan indicating conditions for success has been posted and passes
  • Documentation created/updated
  • Integration tests added to GitLab QA
  • The impact any change in container size has should be evaluated
  • New dependencies are managed with GitLab forked renovatebot
Edited by Stan Hu

Merge request reports

Loading