Webservice readiness probe returns 503 during shutdown

Summary

An customer has reported seeing HTTP 503 from readiness / liveness checks

The Gitlab Webservice deployment pods have been failing readiness checks, and are being killed for posting as unhealthy, at least as frequently as once per hour for the past few weeks (including pre-upgrade to Gitlab 15.1.2, when we were on Gitlab 15.1.0). CPU and Memory utilization during the time of failures does not show as being in excess (or even halfway) to the limits we've set, and little to no CPU throttling is occurring (see first attachment depicting usage for one such pod for details).

All we have to go off of are the logs from the webservice and workhorse containers that I will post below, with no smoking gun indicating as to what is triggering the {"error":null,"level":"fatal","msg":"shutting down","time":"2022-07-25T22:15:09Z"} message that we're getting that seemingly results in the failed readiness checks, which are also accompanied by events such as the following from Kubernetes:

38m Warning Unhealthy pod/gitlab-webservice-default-58d8546776-zt8sc Readiness probe failed: HTTP probe failed with statuscode: 503

38m Warning Unhealthy pod/gitlab-webservice-default-58d8546776-zt8sc Readiness probe errored: rpc error: code = Unknown desc = failed to exec in container: container is in CONTAINER_EXITED state

We have narrowed the cause of the pod's termination event stemming from their HPA, which is not customized, and is behaving as currently defined.

Steps to reproduce

  1. Install an instance
  2. Configure stern to watch for the Pod logs, possible refined to the webservice container
  3. Initiate a scale-down event, or manually remove a Pod
  4. See logs

Configuration used

For replication, default out of box are sufficient.

Current behavior

Readiness probe throws 503 from Webservice in the middle of shutdown

Expected behavior

Not 503?

Versions

  • Chart: v6.1.0+
  • Platform: Unknown
  • Kubernetes: v1.20.13
  • Helm: Unknown

Relevant logs

## curl on repeat
{"component": "gitlab","subcomponent":"production_json","level":"info","method":"HEAD","path":"/","format":"*/*","controller":"RootController","action":"index","status":302,"location":"https://gitlab.separate-containers.party/users/sign_in","time":"2022-07-27T13:46:21.777Z","params":[],"correlation_id":"01G8ZYH3P1KF2BDX1GDS2JX22T","meta.client_id":"ip/REDACTED","meta.caller_id":"RootController#index","meta.remote_ip":"REDACTED","meta.feature_category":"projects","remote_ip":"REDACTED","ua":"curl/7.84.0","queue_duration_s":0.006108,"request_urgency":"low","target_duration_s":5,"db_count":0,"db_write_count":0,"db_cached_count":0,"db_replica_count":0,"db_primary_count":0,"db_main_count":0,"db_main_replica_count":0,"db_replica_cached_count":0,"db_primary_cached_count":0,"db_main_cached_count":0,"db_main_replica_cached_count":0,"db_replica_wal_count":0,"db_primary_wal_count":0,"db_main_wal_count":0,"db_main_replica_wal_count":0,"db_replica_wal_cached_count":0,"db_primary_wal_cached_count":0,"db_main_wal_cached_count":0,"db_main_replica_wal_cached_count":0,"db_replica_duration_s":0.0,"db_primary_duration_s":0.0,"db_main_duration_s":0.0,"db_main_replica_duration_s":0.0,"cpu_s":0.014019,"mem_objects":8172,"mem_bytes":266392,"mem_mallocs":940,"mem_total_bytes":593272,"pid":40,"worker_id":"puma_0","rate_limiting_gates":[],"db_duration_s":0.0,"view_duration_s":0.0,"duration_s":0.00365}

## termination signal has been received
{"component": "gitlab","subcomponent":"puma.stdout","level":"unknown","timestamp":"2022-07-27T13:42:44.962Z","pid":1,"message":"- Worker 1 (PID: 42) booted in 0.69s, phase: 0"}
{"component": "gitlab","subcomponent":"puma.stdout","level":"unknown","timestamp":"2022-07-27T13:46:22.101Z","pid":1,"message":"=== puma shutdown: 2022-07-27 13:46:22 +0000 ==="}
{"component": "gitlab","subcomponent":"puma.stdout","level":"unknown","timestamp":"2022-07-27T13:46:22.102Z","pid":1,"message":"- Goodbye!"}
{"component": "gitlab","subcomponent":"puma.stdout","level":"unknown","timestamp":"2022-07-27T13:46:22.101Z","pid":1,"message":"=== puma shutdown: 2022-07-27 13:46:22 +0000 ==="}
{"component": "gitlab","subcomponent":"puma.stdout","level":"unknown","timestamp":"2022-07-27T13:46:22.102Z","pid":1,"message":"- Goodbye!"}

## rediness probe
{"component": "gitlab","subcomponent":"production_json","level":"info","method":"GET","path":"/-/readiness","format":"*/*","controller":"HealthController","action":"readiness","status":503,"time":"2022-07-27T13:46:24.386Z","params":[],"db_count":0,"db_write_count":0,"db_cached_count":0,"db_replica_count":0,"db_primary_count":0,"db_main_count":0,"db_main_replica_count":0,"db_replica_cached_count":0,"db_primary_cached_count":0,"db_main_cached_count":0,"db_main_replica_cached_count":0,"db_replica_wal_count":0,"db_primary_wal_count":0,"db_main_wal_count":0,"db_main_replica_wal_count":0,"db_replica_wal_cached_count":0,"db_primary_wal_cached_count":0,"db_main_wal_cached_count":0,"db_main_replica_wal_cached_count":0,"db_replica_duration_s":0.0,"db_primary_duration_s":0.0,"db_main_duration_s":0.0,"db_main_replica_duration_s":0.0,"cpu_s":0.006068,"mem_objects":1312,"mem_bytes":45544,"mem_mallocs":236,"mem_total_bytes":98024,"pid":40,"worker_id":"puma_0","rate_limiting_gates":[],"correlation_id":"b9d15387-9663-48b5-bf32-5e7446802abe","db_duration_s":0.0,"view_duration_s":0.00018,"duration_s":0.00078}
{"component": "gitlab","subcomponent":"production_json","level":"info","method":"GET","path":"/-/readiness","format":"*/*","controller":"HealthController","action":"readiness","status":503,"time":"2022-07-27T13:46:29.385Z","params":[],"db_count":0,"db_write_count":0,"db_cached_count":0,"db_replica_count":0,"db_primary_count":0,"db_main_count":0,"db_main_replica_count":0,"db_replica_cached_count":0,"db_primary_cached_count":0,"db_main_cached_count":0,"db_main_replica_cached_count":0,"db_replica_wal_count":0,"db_primary_wal_count":0,"db_main_wal_count":0,"db_main_replica_wal_count":0,"db_replica_wal_cached_count":0,"db_primary_wal_cached_count":0,"db_main_wal_cached_count":0,"db_main_replica_wal_cached_count":0,"db_replica_duration_s":0.0,"db_primary_duration_s":0.0,"db_main_duration_s":0.0,"db_main_replica_duration_s":0.0,"cpu_s":0.005491,"mem_objects":1311,"mem_bytes":45544,"mem_mallocs":236,"mem_total_bytes":97984,"pid":40,"worker_id":"puma_0","rate_limiting_gates":[],"correlation_id":"fc8411c8-d1a2-4d8b-ba7a-265e266f8096","db_duration_s":0.0,"view_duration_s":0.00019,"duration_s":0.00079}

## final component shutdown & PID 1 exit
{"component": "gitlab","subcomponent":"web_exporter","level":"info","time":"2022-07-27T13:46:32Z","message":"[2022-07-27T13:46:32.103+0000] INFO  going to shutdown ..."}
{"component": "gitlab","subcomponent":"web_exporter","level":"info","time":"2022-07-27T13:46:32Z","message":"[2022-07-27T13:46:32.104+0000] INFO  WEBrick::HTTPServer#start done."}
{"component": "gitlab","subcomponent":"puma.stdout","level":"unknown","timestamp":"2022-07-27T13:46:33.105Z","pid":1,"message":"- Gracefully shutting down workers..."}