2019-03-13 GitLab.com Registry Errors

Summary

On March 13th 2019, 2+ Google Cloud services had incidents that ended up impacting our Registry service, which is used by our customers (as well as internally) to push and pull Docker images. While our service was getting 503s from Google API during early hours of the incident, the situation got worse due to image layers getting uploaded with empty manifests. Google's incidents got resolved within 3~ hours from the start. From this time, errors continued to occur on our side but these were for existing images that got pushed during the incident and ended up getting uploaded with empty manifests.

Service(s) affected : Registry Team attribution : distribution Minutes downtime or degradation : 3~ hours of downtime

Timeline

2019-03-13

02:27 UTC - First alerts for registry
03:28 UTC - Alerts for Registry
03:38 UTC - started incident on status.io
03:46 UTC - Correlated the issue to GCS incident
03:46 UTC - Updated status.io
04:41 UTC - Updated status.io informing we were closely following GCS incident
06:04 UTC - Updated status.io saying GCS issue was gradually recovering and we were monitoring
06:43 UTC - GCS updated that all issues got resolved on their side but we were still seeing errors
07:00 UTC - Tried resetting a registry host to see if it couldn't recover gracefully from a dependency failure - to no avail
08:32 UTC - Further investigation with a teammate
09:30 UTC - We found the root cause for the unexpected end of JSON input the error
09:51 UTC - We rebuilt a sample image that was having the issue and confirmed it fixed the error
09:51 UTC - We looked into an image that was causing the invalid checksum digest format. While we couldn't pin down the exact source of error we thought the same fix might work for this one as well. (We did validate the digest values. This was an image from last week). Note: This error did happen occassionally last week as well. Just happened at a much elevated rate during the incident today.

Next Steps

Update status.io informing users of the steps required to resolve the issue
Investigate remaining errors (unexpected end of JSON input), pull the images, identify owners and ask Support to reach out to them with a message asking them to rebuild and push the images
Investigate remaining errors
Identify corrective action items
Close this issue, create an RCA in infra queue

Edited Mar 13, 2019 by Amarbayar Amarsanaa