Webservice readiness probe returns 503 during shutdown
Summary
An customer has reported seeing HTTP 503 from readiness / liveness checks
The Gitlab Webservice deployment pods have been failing readiness checks, and are being killed for posting as unhealthy, at least as frequently as once per hour for the past few weeks (including pre-upgrade to Gitlab 15.1.2, when we were on Gitlab 15.1.0). CPU and Memory utilization during the time of failures does not show as being in excess (or even halfway) to the limits we've set, and little to no CPU throttling is occurring (see first attachment depicting usage for one such pod for details).
All we have to go off of are the logs from the webservice and workhorse containers that I will post below, with no smoking gun indicating as to what is triggering the
{"error":null,"level":"fatal","msg":"shutting down","time":"2022-07-25T22:15:09Z"}message that we're getting that seemingly results in the failed readiness checks, which are also accompanied by events such as the following from Kubernetes:
38m Warning Unhealthy pod/gitlab-webservice-default-58d8546776-zt8sc Readiness probe failed: HTTP probe failed with statuscode: 503
38m Warning Unhealthy pod/gitlab-webservice-default-58d8546776-zt8sc Readiness probe errored: rpc error: code = Unknown desc = failed to exec in container: container is in CONTAINER_EXITED state
We have narrowed the cause of the pod's termination event stemming from their HPA, which is not customized, and is behaving as currently defined.
Steps to reproduce
- Install an instance
- Configure
sternto watch for the Pod logs, possible refined to thewebservicecontainer - Initiate a scale-down event, or manually remove a Pod
- See logs
Configuration used
For replication, default out of box are sufficient.
Current behavior
Readiness probe throws 503 from Webservice in the middle of shutdown
Expected behavior
Not 503?
Versions
- Chart: v6.1.0+
- Platform: Unknown
- Kubernetes: v1.20.13
- Helm: Unknown
Relevant logs
## curl on repeat
{"component": "gitlab","subcomponent":"production_json","level":"info","method":"HEAD","path":"/","format":"*/*","controller":"RootController","action":"index","status":302,"location":"https://gitlab.separate-containers.party/users/sign_in","time":"2022-07-27T13:46:21.777Z","params":[],"correlation_id":"01G8ZYH3P1KF2BDX1GDS2JX22T","meta.client_id":"ip/REDACTED","meta.caller_id":"RootController#index","meta.remote_ip":"REDACTED","meta.feature_category":"projects","remote_ip":"REDACTED","ua":"curl/7.84.0","queue_duration_s":0.006108,"request_urgency":"low","target_duration_s":5,"db_count":0,"db_write_count":0,"db_cached_count":0,"db_replica_count":0,"db_primary_count":0,"db_main_count":0,"db_main_replica_count":0,"db_replica_cached_count":0,"db_primary_cached_count":0,"db_main_cached_count":0,"db_main_replica_cached_count":0,"db_replica_wal_count":0,"db_primary_wal_count":0,"db_main_wal_count":0,"db_main_replica_wal_count":0,"db_replica_wal_cached_count":0,"db_primary_wal_cached_count":0,"db_main_wal_cached_count":0,"db_main_replica_wal_cached_count":0,"db_replica_duration_s":0.0,"db_primary_duration_s":0.0,"db_main_duration_s":0.0,"db_main_replica_duration_s":0.0,"cpu_s":0.014019,"mem_objects":8172,"mem_bytes":266392,"mem_mallocs":940,"mem_total_bytes":593272,"pid":40,"worker_id":"puma_0","rate_limiting_gates":[],"db_duration_s":0.0,"view_duration_s":0.0,"duration_s":0.00365}
## termination signal has been received
{"component": "gitlab","subcomponent":"puma.stdout","level":"unknown","timestamp":"2022-07-27T13:42:44.962Z","pid":1,"message":"- Worker 1 (PID: 42) booted in 0.69s, phase: 0"}
{"component": "gitlab","subcomponent":"puma.stdout","level":"unknown","timestamp":"2022-07-27T13:46:22.101Z","pid":1,"message":"=== puma shutdown: 2022-07-27 13:46:22 +0000 ==="}
{"component": "gitlab","subcomponent":"puma.stdout","level":"unknown","timestamp":"2022-07-27T13:46:22.102Z","pid":1,"message":"- Goodbye!"}
{"component": "gitlab","subcomponent":"puma.stdout","level":"unknown","timestamp":"2022-07-27T13:46:22.101Z","pid":1,"message":"=== puma shutdown: 2022-07-27 13:46:22 +0000 ==="}
{"component": "gitlab","subcomponent":"puma.stdout","level":"unknown","timestamp":"2022-07-27T13:46:22.102Z","pid":1,"message":"- Goodbye!"}
## rediness probe
{"component": "gitlab","subcomponent":"production_json","level":"info","method":"GET","path":"/-/readiness","format":"*/*","controller":"HealthController","action":"readiness","status":503,"time":"2022-07-27T13:46:24.386Z","params":[],"db_count":0,"db_write_count":0,"db_cached_count":0,"db_replica_count":0,"db_primary_count":0,"db_main_count":0,"db_main_replica_count":0,"db_replica_cached_count":0,"db_primary_cached_count":0,"db_main_cached_count":0,"db_main_replica_cached_count":0,"db_replica_wal_count":0,"db_primary_wal_count":0,"db_main_wal_count":0,"db_main_replica_wal_count":0,"db_replica_wal_cached_count":0,"db_primary_wal_cached_count":0,"db_main_wal_cached_count":0,"db_main_replica_wal_cached_count":0,"db_replica_duration_s":0.0,"db_primary_duration_s":0.0,"db_main_duration_s":0.0,"db_main_replica_duration_s":0.0,"cpu_s":0.006068,"mem_objects":1312,"mem_bytes":45544,"mem_mallocs":236,"mem_total_bytes":98024,"pid":40,"worker_id":"puma_0","rate_limiting_gates":[],"correlation_id":"b9d15387-9663-48b5-bf32-5e7446802abe","db_duration_s":0.0,"view_duration_s":0.00018,"duration_s":0.00078}
{"component": "gitlab","subcomponent":"production_json","level":"info","method":"GET","path":"/-/readiness","format":"*/*","controller":"HealthController","action":"readiness","status":503,"time":"2022-07-27T13:46:29.385Z","params":[],"db_count":0,"db_write_count":0,"db_cached_count":0,"db_replica_count":0,"db_primary_count":0,"db_main_count":0,"db_main_replica_count":0,"db_replica_cached_count":0,"db_primary_cached_count":0,"db_main_cached_count":0,"db_main_replica_cached_count":0,"db_replica_wal_count":0,"db_primary_wal_count":0,"db_main_wal_count":0,"db_main_replica_wal_count":0,"db_replica_wal_cached_count":0,"db_primary_wal_cached_count":0,"db_main_wal_cached_count":0,"db_main_replica_wal_cached_count":0,"db_replica_duration_s":0.0,"db_primary_duration_s":0.0,"db_main_duration_s":0.0,"db_main_replica_duration_s":0.0,"cpu_s":0.005491,"mem_objects":1311,"mem_bytes":45544,"mem_mallocs":236,"mem_total_bytes":97984,"pid":40,"worker_id":"puma_0","rate_limiting_gates":[],"correlation_id":"fc8411c8-d1a2-4d8b-ba7a-265e266f8096","db_duration_s":0.0,"view_duration_s":0.00019,"duration_s":0.00079}
## final component shutdown & PID 1 exit
{"component": "gitlab","subcomponent":"web_exporter","level":"info","time":"2022-07-27T13:46:32Z","message":"[2022-07-27T13:46:32.103+0000] INFO going to shutdown ..."}
{"component": "gitlab","subcomponent":"web_exporter","level":"info","time":"2022-07-27T13:46:32Z","message":"[2022-07-27T13:46:32.104+0000] INFO WEBrick::HTTPServer#start done."}
{"component": "gitlab","subcomponent":"puma.stdout","level":"unknown","timestamp":"2022-07-27T13:46:33.105Z","pid":1,"message":"- Gracefully shutting down workers..."}