Container Registry goes into Ready state when misconfigured (#37) · Issues · GitLab.org / container-registry

Container Registry goes into Ready state when misconfigured

### Summary During a troubleshooting session, I discovered the Container Registry was misconfigured, but the service was allowed to enter a Ready state. This should not have been allowed to happen. At some point, something triggers the Container Registry to check the bucket configuration, and from that point forth the service will then return an HTTP503. This creates an issue where Kubernetes will think the service is ready causing the helm install to complete successfully when in reality, it should fail. If one is leveraging the use of the `--atomic` flag during a helm upgrade, this failure is missed and only after the service is then downed by Kubernetes do we see an issue and can start investigating. This is disruptive and should be caught during an upgrade or configuration change process rather than a few minutes afterwards. ### Steps to reproduce * Perform a minimalist installation of the Container Registry using the GitLab helm chart. * Purposely mis-configure the storage bucket to be utilized * Observe an upgrade successfully proceed * After roughly 30 seconds the service will return an HTTP503, for which Kubernetes will eventually react ### What is the current *bug* behavior? The Container Registry service goes down. ### What is the expected *correct* behavior? The failed configuration should be noticed during the startup process of the service. This provides the capability to signal a problem with the deployment. If one uses the `--atomic` flag during helm upgrades, helm will detect the service is not ready and attempt a rollback. ### Relevant logs and/or screenshots This is made a little difficult as healthchecks hit the `:5001/debug/health` endpoint, which are not sent to the application logs. The below is an attempt at viewing logs as soon as possible. As seen below, we pass the first readiness Probe and the Pod starts to receive traffic, serving a 200 to the health checks. At some point, roughly 20 seconds later, we start to serve a 503. Using `kubectl port-forward` for the below testing: ``` % curl -i localhost:5001/debug/health HTTP/1.1 200 OK Content-Length: 2 Content-Type: application/json; charset=utf-8 Date: Fri, 21 Feb 2020 21:49:19 GMT {} % curl -i localhost:5001/debug/health HTTP/1.1 503 Service Unavailable Content-Length: 58 Content-Type: application/json; charset=utf-8 Date: Fri, 21 Feb 2020 21:49:41 GMT {"storagedriver_gcs":"gcs: storage: bucket doesn't exist"}% ``` ``` Warning Unhealthy 3m3s (x3 over 3m23s) kubelet, gke-pre-gitlab-gke-node-pool-0-715a725e-401r Liveness probe failed: HTTP probe failed with statuscode: 503 Normal Killing 3m3s kubelet, gke-pre-gitlab-gke-node-pool-0-715a725e-401r Container registry failed liveness probe, will be restarted Warning Unhealthy 2m33s (x12 over 3m28s) kubelet, gke-pre-gitlab-gke-node-pool-0-715a725e-401r Readiness probe failed: HTTP probe failed with statuscode: 503 ``` This can be occasionally seen by external systems hitting the root endpoint of the container registry: ![image](/uploads/821f00f25ae97edbc76c64df462ef80f/image.png) From a technical standpoint, none of these requests should have made it through to the Container Registry as the service should be marked unhealthy and stuck in a CrashLoop. ### Output of checks This bug has the potential to impact GitLab.com

issue