Container Registry goes into Ready state when misconfigured
Summary
During a troubleshooting session, I discovered the Container Registry was misconfigured, but the service was allowed to enter a Ready state. This should not have been allowed to happen. At some point, something triggers the Container Registry to check the bucket configuration, and from that point forth the service will then return an HTTP503. This creates an issue where Kubernetes will think the service is ready causing the helm install to complete successfully when in reality, it should fail. If one is leveraging the use of the --atomic flag during a helm upgrade, this failure is missed and only after the service is then downed by Kubernetes do we see an issue and can start investigating. This is disruptive and should be caught during an upgrade or configuration change process rather than a few minutes afterwards.
Steps to reproduce
- Perform a minimalist installation of the Container Registry using the GitLab helm chart.
- Purposely mis-configure the storage bucket to be utilized
- Observe an upgrade successfully proceed
- After roughly 30 seconds the service will return an HTTP503, for which Kubernetes will eventually react
What is the current bug behavior?
The Container Registry service goes down.
What is the expected correct behavior?
The failed configuration should be noticed during the startup process of the service. This provides the capability to signal a problem with the deployment. If one uses the --atomic flag during helm upgrades, helm will detect the service is not ready and attempt a rollback.
Relevant logs and/or screenshots
This is made a little difficult as healthchecks hit the :5001/debug/health endpoint, which are not sent to the application logs.
The below is an attempt at viewing logs as soon as possible. As seen below, we pass the first readiness Probe and the Pod starts to receive traffic, serving a 200 to the health checks. At some point, roughly 20 seconds later, we start to serve a 503. Using kubectl port-forward for the below testing:
% curl -i localhost:5001/debug/health
HTTP/1.1 200 OK
Content-Length: 2
Content-Type: application/json; charset=utf-8
Date: Fri, 21 Feb 2020 21:49:19 GMT
{}
% curl -i localhost:5001/debug/health
HTTP/1.1 503 Service Unavailable
Content-Length: 58
Content-Type: application/json; charset=utf-8
Date: Fri, 21 Feb 2020 21:49:41 GMT
{"storagedriver_gcs":"gcs: storage: bucket doesn't exist"}%
Warning Unhealthy 3m3s (x3 over 3m23s) kubelet, gke-pre-gitlab-gke-node-pool-0-715a725e-401r Liveness probe failed: HTTP probe failed with statuscode: 503
Normal Killing 3m3s kubelet, gke-pre-gitlab-gke-node-pool-0-715a725e-401r Container registry failed liveness probe, will be restarted
Warning Unhealthy 2m33s (x12 over 3m28s) kubelet, gke-pre-gitlab-gke-node-pool-0-715a725e-401r Readiness probe failed: HTTP probe failed with statuscode: 503
This can be occasionally seen by external systems hitting the root endpoint of the container registry:
From a technical standpoint, none of these requests should have made it through to the Container Registry as the service should be marked unhealthy and stuck in a CrashLoop.
Output of checks
This bug has the potential to impact GitLab.com
