How do we increase uptime of the API that authenticates the registry
Problem to solve
If you use the GitLab Container Registry for your production, runtime environments you need to have the highest possible reliability, so that your production applications scale and run reliably.
The reliability of a distributed system is at most the reliability of the least reliable component. So as we consider improving the reliability of the registry, we have to improve the reliability of each dependency.
The registry is a separate service from the GitLab Rails app, however, the registry auth service is still dependent on the rails app. This could create a situation where the registry is technically available but unreachable for private images.
The registry consistently has higher availability than the GitLab API. As an example, in the last 30 days the registry availability was 99.99% and the GitLab API 99.77% (source). If looking since the beginning of the year, we can see that the registry SLA trend is far more consistent and stable (source).
Looking at the history of recent incidents, the following are relevant to this discussion:
Description | Start Date | End Date | Issue |
---|---|---|---|
Brief downtime and degraded performance from db failover | October 29, 2020 21:38 UTC | October 30, 2020 00:00 UTC | gitlab-com/gl-infra/production#2937 (closed) |
GitLab.com is experiencing degraded availability because of high database load | October 24, 2020 09:36 UTC | October 24, 2020 11:32 UTC | gitlab-com/gl-infra/production#2885 (closed) |
A database issue seems to have been the cause of both incidents and not the application itself (API).
Proposal
Determine a path forward that will allow us to increase uptime of the API that authenticates the registry.