Container Registry goes into Ready state when misconfigured

Summary

During a troubleshooting session, I discovered the Container Registry was misconfigured, but the service was allowed to enter a Ready state. This should not have been allowed to happen. At some point, something triggers the Container Registry to check the bucket configuration, and from that point forth the service will then return an HTTP503. This creates an issue where Kubernetes will think the service is ready causing the helm install to complete successfully when in reality, it should fail. If one is leveraging the use of the --atomic flag during a helm upgrade, this failure is missed and only after the service is then downed by Kubernetes do we see an issue and can start investigating. This is disruptive and should be caught during an upgrade or configuration change process rather than a few minutes afterwards.

Steps to reproduce

Perform a minimalist installation of the Container Registry using the GitLab helm chart.
Purposely mis-configure the storage bucket to be utilized
Observe an upgrade successfully proceed
After roughly 30 seconds the service will return an HTTP503, for which Kubernetes will eventually react

What is the current bug behavior?

The Container Registry service goes down.

What is the expected correct behavior?

The failed configuration should be noticed during the startup process of the service. This provides the capability to signal a problem with the deployment. If one uses the --atomic flag during helm upgrades, helm will detect the service is not ready and attempt a rollback.

Relevant logs and/or screenshots

This is made a little difficult as healthchecks hit the :5001/debug/health endpoint, which are not sent to the application logs.

The below is an attempt at viewing logs as soon as possible. As seen below, we pass the first readiness Probe and the Pod starts to receive traffic, serving a 200 to the health checks. At some point, roughly 20 seconds later, we start to serve a 503. Using kubectl port-forward for the below testing:

 % curl -i localhost:5001/debug/health
HTTP/1.1 200 OK
Content-Length: 2
Content-Type: application/json; charset=utf-8
Date: Fri, 21 Feb 2020 21:49:19 GMT

{}
 % curl -i localhost:5001/debug/health
HTTP/1.1 503 Service Unavailable
Content-Length: 58
Content-Type: application/json; charset=utf-8
Date: Fri, 21 Feb 2020 21:49:41 GMT

{"storagedriver_gcs":"gcs: storage: bucket doesn't exist"}%

  Warning  Unhealthy  3m3s (x3 over 3m23s)    kubelet, gke-pre-gitlab-gke-node-pool-0-715a725e-401r  Liveness probe failed: HTTP probe failed with statuscode: 503
  Normal   Killing    3m3s                    kubelet, gke-pre-gitlab-gke-node-pool-0-715a725e-401r  Container registry failed liveness probe, will be restarted
  Warning  Unhealthy  2m33s (x12 over 3m28s)  kubelet, gke-pre-gitlab-gke-node-pool-0-715a725e-401r  Readiness probe failed: HTTP probe failed with statuscode: 503

This can be occasionally seen by external systems hitting the root endpoint of the container registry:

From a technical standpoint, none of these requests should have made it through to the Container Registry as the service should be marked unhealthy and stuck in a CrashLoop.

Output of checks

This bug has the potential to impact GitLab.com