The Container Registry needs improvements to error handling
High rates of HTTP500's from the Container Registry are causing undue stress on both the Support and Infrastructure teams. The most common 500 we see is an error:
invalid checksum digest format. Nearly every time this happens, it's due to some corruption of the image in cloud storage. Since, the Infrastructure team has tried to capture each occurrence of this happening:
- gitlab-com/gl-infra/production#1164 (closed)
- gitlab-com/gl-infra/production#1009 (closed)
- gitlab-com/gl-infra/production#993 (closed)
- gitlab-com/gl-infra/production#723 (closed)
- gitlab-com/gl-infra/production#1197 (closed)
Over the course of time, we've adjusted our alerts to reduce alert fatigue:
- gitlab-com/runbooks!1486 (merged) - change how we calculate the alert
- gitlab-com/runbooks!1206 (merged) - alerts when exceeding 20% error ratio
- gitlab-com/runbooks!1193 (merged) - alerts when exceeding 10% error ratio
- gitlab-com/runbooks!1148 (merged) - alerts when exceeding 5% error ratio over the course of 5 minutes
- gitlab-com/runbooks!1146 (merged) - alerts when exceeding 50% error ratio
- gitlab-com/runbooks!665 (merged) - introduction of the 5xx alert, fires when we see any 5xx over the course of 1 minute
In some cases, the issue is so bad, we end up blocking end users requests:
And we've investigated this to the best of our abilities over time due to having been paged:
- gitlab-com/gl-infra/infrastructure#7605 (closed)
- gitlab-com/gl-infra/infrastructure#7091 (closed)
- gitlab-com/gl-infra/infrastructure#7048 (closed)
Despite our efforts, we continue exceed our desired SLO on a regular basis:
As seen from the chart below, we experience a few other issues that lead to 500's being shown to our users:
The top 2 cases of issues are the following:
invalid checksum digest format
unexpected end of JSON input
It should be noted that this impacts few customers at a time, but which customer is impacted varies day-to-day.
This negatively impacts the on-call SRE and Support Engineers as they are forced to perform the following for every page of this alert:
- Dig into logs to discover an image or target end user at the root cause
- Reach out to support to ask and inform the owner of the project the situation
- On-call will then perform the action of removing the corrupted item from object storage
All of this takes a lot of time and effort spread across two teams.
There's a high confidence in that this is a repeatable issue that currently, does not look like it's going away. We also store X amount of data in our object storage and with increased usage of the registry, with the adoption of Kubernetes, AutoDevops, and general growth of GitLab.com, this problem will only continue to get worse.
This will continue to invoke the Infrastructure team until we are able to resolve the issue where we are storing corrupted images and this team is the frontline for receiving the alerts for the Container Registry service. The Support team will continue to be invoked as we reach out to customers to assist in resolving the issue that has appeared. It would be wise to invoke the Package team to deeply investigate how we can prevent corrupt images and make the necessary changes to the Container Registry to avoid this issue continuously happening.