The Container Registry needs improvements to error handling

Problem Statement

High rates of HTTP500's from the Container Registry are causing undue stress on both the Support and Infrastructure teams. The most common 500 we see is an error: invalid checksum digest format. Nearly every time this happens, it's due to some corruption of the image in cloud storage. Since, the Infrastructure team has tried to capture each occurrence of this happening:

Over the course of time, we've adjusted our alerts to reduce alert fatigue:

gitlab-com/runbooks!1486 (merged) - change how we calculate the alert
gitlab-com/runbooks!1206 (merged) - alerts when exceeding 20% error ratio
gitlab-com/runbooks!1193 (merged) - alerts when exceeding 10% error ratio
gitlab-com/runbooks!1148 (merged) - alerts when exceeding 5% error ratio over the course of 5 minutes
gitlab-com/runbooks!1146 (merged) - alerts when exceeding 50% error ratio
gitlab-com/runbooks!665 (merged) - introduction of the 5xx alert, fires when we see any 5xx over the course of 1 minute

In some cases, the issue is so bad, we end up blocking end users requests:

And we've investigated this to the best of our abilities over time due to having been paged:

Despite our efforts, we continue exceed our desired SLO on a regular basis:

Source

Reach

As seen from the chart below, we experience a few other issues that lead to 500's being shown to our users:

Source

The top 2 cases of issues are the following:

invalid checksum digest format
unexpected end of JSON input

It should be noted that this impacts few customers at a time, but which customer is impacted varies day-to-day.

Impact

This negatively impacts the on-call SRE and Support Engineers as they are forced to perform the following for every page of this alert:

Dig into logs to discover an image or target end user at the root cause
Reach out to support to ask and inform the owner of the project the situation
On-call will then perform the action of removing the corrupted item from object storage

All of this takes a lot of time and effort spread across two teams.

Confidence

There's a high confidence in that this is a repeatable issue that currently, does not look like it's going away. We also store X amount of data in our object storage and with increased usage of the registry, with the adoption of Kubernetes, AutoDevops, and general growth of GitLab.com, this problem will only continue to get worse.

Effort

This will continue to invoke the Infrastructure team until we are able to resolve the issue where we are storing corrupted images and this team is the frontline for receiving the alerts for the Container Registry service. The Support team will continue to be invoked as we reach out to customers to assist in resolving the issue that has appeared. It would be wise to invoke the Package team to deeply investigate how we can prevent corrupt images and make the necessary changes to the Container Registry to avoid this issue continuously happening.

Edited Sep 27, 2019 by Marin Jankovski