2022-02-18: Creation timestamp of container images shown as null in the UI/API

Incident DRI

Current Status

The root cause was identified (gitlab-org/gitlab#353244 (closed)) and a fix (gitlab-org/gitlab!81056 (merged)) is on the way.

Summary for CMOC notice / Exec summary:

Customer Impact: Users cannot see the creation timestamp of container images in the GitLab UI/API. There is a workaround described here.
Service Impact: ServiceContainer Registry
Impact Duration: 2022-01-26 - end time UTC ( duration in minutes )
Root cause: The "follow redirect" functionality in the Rails container registry client was incompatible with redirections to Google Cloud CDN (gitlab-org/gitlab#353244 (closed)).

Timeline

Problem should have started on 2022-01-26 when we started the gradual rollout of the Cloud CDN feature for GitLab.com. Nothing was reported until 2022-02-18;
Incident declared on 2022-02-18, in response to a customer report (gitlab-org/gitlab#352999 (closed));
Root cause and corrective actions identified on 2022-02-18;
Fix deployed to GitLab.com on date ;
Resolved on end time UTC .

Recent Events (available internally only):

All times UTC.

2022-02-18

11:25 - @jdrpereira declares incident in Slack.

Takeaways

This subtle issue went unnoticed during the preparation and initial release of the Cloud CDN feature for the Container Registry on GitLab.com (so this affected GitLab.com only). The problem only revealed itself when invoking the Rails API/UI from outside GCP, as otherwise, the registry continues to redirect clients to GCS and not CDN. The only place where Rails performs a blob download is for detecting the creation timestamp of a container image. When it fails to obtain such blob, the creation timestamp is displayed as "just now" in the Rais UI and null in the API. So multiple things went wrong here:

We did not pay enough attention to the Rails UI/API when testing the CDN feature in staging. The lack of the creation timestamp went unnoticed;
Rails is silently failing when unable to download a blob from the registry, which has not raised any alarm (e.g. warn/error log message or exception in Sentry). In conjunction with the above, this caused us to not realize we had a problem;
Our QA tests run within GCP, which means that they don't trigger specific behaviors only observed when requests originate outside GCP. This is a blind spot.

Corrective Actions

gitlab-org/gitlab#353244 (closed): To fix the actual bug;
gitlab-org/gitlab#353223: To log all failed requests in the container registry client;
gitlab-org/gitlab#353230 (closed): To patch the QA blind spot in regards to requests originating outside GCP.

Note: In some cases we need to redact information from public view. We only do this in a limited number of documented cases. This might include the summary, timeline or any other bits of information, laid out in out handbook page. Any of this confidential data will be in a linked issue, only visible internally. By default, all information we can share, will be public, in accordance to our transparency value.

Create a confidential issue

Click to expand or collapse the Incident Review section.

Incident Review

Ensure that the exec summary is completed at the top of the incident issue, the timeline is updated and relevant graphs are included in the summary
If there are any corrective action items mentioned in the notes on the incident, ensure they are listed in the "Corrective Action" section
Fill out relevant sections below or link to the meeting review notes that cover these topics

Customer Impact

Who was impacted by this incident? (i.e. external customers, internal customers)
1. ...
What was the customer experience during the incident? (i.e. preventing them from doing X, incorrect display of Y, ...)
1. ...
How many customers were affected?
1. ...
If a precise customer impact number is unknown, what is the estimated impact (number and ratio of failed requests, amount of traffic drop, ...)?
1. ...

What were the root causes?

Incident Response Analysis

How was the incident detected?
1. ...
How could detection time be improved?
1. ...
How was the root cause diagnosed?
1. ...
How could time to diagnosis be improved?
1. ...
How did we reach the point where we knew how to mitigate the impact?
1. ...
How could time to mitigation be improved?
1. ...
What went well?
1. ...

Post Incident Analysis

Did we have other events in the past with the same root cause?
1. ...
Do we have existing backlog items that would've prevented or greatly reduced the impact of this incident?
1. ...
Was this incident triggered by a change (deployment of code or change to infrastructure)? If yes, link the issue.
1. ...

What went well?

Guidelines

Blameless RCA Guideline

Resources

If the Situation Zoom room was utilised, recording will be automatically uploaded to Incident room Google Drive folder (private)

Edited Feb 23, 2022 by João Pereira