2021-07-20: 400 errors while uploading CI artifacts

Timeline

View recent production deployment and configuration events / gcp events (internal only)

All times UTC.

2021-07-20

23:10 - @ggillies approves and merges MR gitlab-com/gl-infra/k8s-workloads/gitlab-com!1037 (merged) causing a CI pipeline to roll the change out
23:15 - Initial error detected in log (post incident). These errors increased over next 20 mins, associated with the k8s pod rollout.
23:27 - A gradual increase in 400s to /api/v4/jobs/xxxxx/artifacts endpoint
23:42 - Pipeline for gitlab-com/gl-infra/k8s-workloads/gitlab-com!1037 (merged) finishes rolling out change to production
23:58 - @stanhu declares incident in Slack. Sees 400 error in https://gitlab.com/gitlab-org/gitaly/-/jobs/1439734466 and confirms there is an increased incident rate.

2021-07-21

00:20 - CMOC paged
00:21 - @ggillies opens MR to revert suspected bad change and merges/applies it gitlab-com/gl-infra/k8s-workloads/gitlab-com!1038 (merged)
00:26 - Pipeline started to rollback recent configuration change
01:06 - Pipeline for revert MR gitlab-com/gl-infra/k8s-workloads/gitlab-com!1038 (merged) finishes rolling out change to production, impact is resolved. Last error message in logs.

Corrective Actions

Update Guidelines for deploying changes for the gitlab-com repo: gitlab-com/gl-infra/k8s-workloads/gitlab-com!1042 (merged)
Remove allow failure in our QA test: gitlab-com/gl-infra/k8s-workloads/gitlab-com!1046 (merged)
Learn why Rails failed with our directory permissions: gitlab-org/gitlab#336609 (closed)
Determine/Implement some method of monitoring and alerting for this situation: https://gitlab.com/gitlab-com/gl-infra/infrastructure/-/issues/13838
Ask rails to log this failure scenario: gitlab-org/gitlab!66595 (merged)
For consideration: Improvement to mounting options for our Helm chart related to the temporary directory: gitlab-org/charts/gitlab#2816 (closed)

Note: In some cases we need to redact information from public view. We only do this in a limited number of documented cases. This might include the summary, timeline or any other bits of information, laid out in out handbook page. Any of this confidential data will be in a linked issue, only visible internally. By default, all information we can share, will be public, in accordance to our transparency value.

Incident Review

Summary

teamDelivery is working on a mitigation for issue: delivery#1864 (closed) due to a bug with a feature of GitLab being worked on here: gitlab-org&6396 (closed). The chosen solution was to provide a configuration for Kuberentes to mount a new volume in the space where rails and workhorse use for temporary files. It was not known to the Engineer that proposed this solution that this mount would have any negative impact. While the solution being implemented was geared towards a service that has not yet been deployed into Production (this being the Web fleet migration into Kubernetes, see &272 (closed)), a limitation of our Helm Chart forced the configuration to be applied to all webservice deployments. This included the API service which handles CI artifact uploads.

When this change rolled out, no alerts to any issues had fired. HTTP400's are not a prime target to alert from. This was instead first noticed by other Engineers, both external and internal to GitLab and after a short investigation we later declared an incident.

It took a bit of time during the investigation but the configuration change was reverted.

Service(s) affected: ServiceGitLab Rails
Team attribution: teamDelivery
Time to detection: 43 minutes
Minutes downtime or degradation: 89 minutes

Metrics

Customer Impact

Who was impacted by this incident? (i.e. external customers, internal customers)
1. Any CI Job that uploads artifacts
What was the customer experience during the incident? (i.e. preventing them from doing X, incorrect display of Y, ...)
1. HTTP400 response during the artifact upload which resulted in a failed CI job
How many customers were affected?
1. All customers that had a Job that uploads artifacts during the incident timeline.
If a precise customer impact number is unknown, what is the estimated impact (number and ratio of failed requests, amount of traffic drop, ...)?
1. 45,866 attempts at upload artifacts failed. Our CI retries by default 3 times, which equates to roughly 15,300 failed CI Jobs

What were the root causes?

ConfigurationChange that led rails to not properly handle requests. This change introduced via Merge Request: gitlab-com/gl-infra/k8s-workloads/gitlab-com!1037 (merged)

Incident Response Analysis

How was the incident detected?
1. Humans discovering their CI jobs have failed.
How could detection time be improved?
1. Add monitoring related to API endpoints where excessive HTTP400's are experienced and/or additional error reporting on rails' inability to function on the temporary directory.
2. Add error reporting on the rails application during the failed request.
How was the root cause diagnosed?
1. Correlation of incident start compared to recent configuration changes introduced into our infrastructure
How could time to diagnosis be improved?
1. We need to improve our alerting capabilities which would have helped more quickly to determine this. We currently alert for 500s but not for 400s. The workhorse logs also did not provide any additional information about the request response making it harder to determine what was the issue. That has already been corrected.
How did we reach the point where we knew how to mitigate the impact?
1. The unknown of the change led to skepticism, but ultimately our policy of ruling out changes led us down this path.
How could time to mitigation be improved?
1. We need to improve our alerting capabilities which would have helped more quickly to determine this.
What went well?
1. ...

Post Incident Analysis

Did we have other events in the past with the same root cause?
1. Yes, but not to .com but rather our self managed customers had reported a similar situation due to an ill configured chart. See gitlab-org/charts/gitlab#1651 (closed)
Do we have existing backlog items that would've prevented or greatly reduced the impact of this incident?
1. No
Was this incident triggered by a change (deployment of code or change to infrastructure)? If yes, link the issue.
1. ConfigurationChange - gitlab-com/gl-infra/k8s-workloads/gitlab-com!1037 (merged)

Lessons Learned

We learned that rails is a bit picky, with reason, regarding it's temporary directory usage. This was unknown during the change request, which then therefore allowed the change to proceed without question.
We lack appropriate alerting on this failure, and therefore humans initiated the incident. No metrics alerted anyone to said situation. During the time frame of the incident, our API showed nothing exciting either: https://dashboards.gitlab.net/d/api-main/api-overview?orgId=1&from=1626818835299&to=1626833235675 We instead had to rely on logs to show us something was wrong.
We had a problem related to communications of this incident, the discussion around this is being handled outside of the context of this incident, see: gitlab-com/support/support-team-meta#3653 (closed)

Guidelines

Blameless RCA Guideline

Resources

If the Situation Zoom room was utilised, recording will be automatically uploaded to Incident room Google Drive folder (private)

Edited Jul 23, 2021 by Cindy Pallares 🦉