2019-03-13 GitLab.com Registry Errors
Summary
On March 13th 2019, 2+ Google Cloud services had incidents that ended up impacting our Registry service, which is used by our customers (as well as internally) to push and pull Docker images. While our service was getting 503s from Google API during early hours of the incident, the situation got worse due to image layers getting uploaded with empty manifests. Google's incidents got resolved within 3~ hours from the start. From this time, errors continued to occur on our side but these were for existing images that got pushed during the incident and ended up getting uploaded with empty manifests.
Service(s) affected : Registry Team attribution : distribution Minutes downtime or degradation : 3~ hours of downtime
Timeline
2019-03-13
- 02:27 UTC - First alerts for registry
- 03:28 UTC - Alerts for Registry
- 03:38 UTC - started incident on status.io
- 03:46 UTC - Correlated the issue to GCS incident
- 03:46 UTC - Updated status.io
- 04:41 UTC - Updated status.io informing we were closely following GCS incident
- 06:04 UTC - Updated status.io saying GCS issue was gradually recovering and we were monitoring
- 06:43 UTC - GCS updated that all issues got resolved on their side but we were still seeing errors
- 07:00 UTC - Tried resetting a registry host to see if it couldn't recover gracefully from a dependency failure - to no avail
- 08:32 UTC - Further investigation with a teammate
- 09:30 UTC - We found the root cause for the
unexpected end of JSON input
the error - 09:51 UTC - We rebuilt a sample image that was having the issue and confirmed it fixed the error
- 09:51 UTC - We looked into an image that was causing the
invalid checksum digest format
. While we couldn't pin down the exact source of error we thought the same fix might work for this one as well. (We did validate the digest values. This was an image from last week). Note: This error did happen occassionally last week as well. Just happened at a much elevated rate during the incident today.
Next Steps
-
Update status.io informing users of the steps required to resolve the issue -
Investigate remaining errors (unexpected end of JSON input), pull the images, identify owners and ask Support to reach out to them with a message asking them to rebuild and push the images -
Investigate remaining errors -
Identify corrective action items -
Close this issue, create an RCA in infra queue
Possibly related to https://status.cloud.google.com/incident/storage/19002