Incident review for CI runners errors 8204 (2023-01-05)

Incident Review - #8204 (closed)

The DRI for the incident review is the issue assignee.

If applicable, ensure that the exec summary is completed at the top of the associated incident issue, the timeline tab is updated and relevant graphs are included.
If there are any corrective actions or infradev issues, ensure they are added as related issues to the original incident.
Fill out relevant sections below or link to the meeting review notes that cover these topics
- Meeting review notes: https://docs.google.com/document/d/1jrX-Z2NJrNjBBcywY7emQKwaKRqVAlDRdGG0Krk76ys/edit#bookmark=id.bc1n9a7wxw3p

Customer Impact

Who was impacted by this incident? (i.e. external customers, internal customers)
1. Customers of GitLab.com using CI runners with jobs that started in the affected window.
What was the customer experience during the incident? (i.e. preventing them from doing X, incorrect display of Y, ...)
1. Some job traces were not viewable and job statuses were never updated.
How many customers were affected?
1. Difficult to say. Guesstimating in the hundreds or thousands of users.
If a precise customer impact number is unknown, what is the estimated impact (number and ratio of failed requests, amount of traffic drop, ...)?
1. During the call it was mentioned that around 200,000 jobs were likely affected.

What were the root causes?

We discovered that changing a column default value in a post-deployment migration is not safe, because Rails is caching default values and not using them if it thinks that a default value is being used. This is a problem with our tooling that didn't take such an edge case into account.

Running a post-deployment migration bump_default_partition_id_value_for_ci_tables resulted in substantial elevation in requests to the patroni-ci database shard and 500 errors on the API because of Rails caching old default values for partition_id column what resulted in foreign key violations.

Incident Response Analysis

How was the incident detected?
1. The incident was detected by a page triggered by an alert on service error rates.
How could detection time be improved?
1. This is about as good as it gets without babysitting the error logs of every relevant service during all changes and deployment phases, which is not scalable for humans to do.
How was the root cause diagnosed?
1. It was hypothesized by a developer for the migration that the root cause was an incorrect partition_id for some/all build_ci_* tables. This was verified by checking the column in a sample of tables using sql statements in a psql session for the ci database shard.
How could time to diagnosis be improved?
1. I (the on-call engineer) sat idle for about five minutes before it was asked whether anyone was checking the value in the table. I apparently missed the ask during the commotion on the incident conference bridge call.
How did we reach the point where we knew how to mitigate the impact?
1. Once the root cause was concretely established, the mitigation was fairly simple and involved marking the migration as completed.
How could time to mitigation be improved?
1. Perhaps proposals for actions for the On-call Engineer to take in the midst of incident triage, troubleshooting, diagnosis, mitigation, and remediation could be made more clearly and called out directly, especially if they are not acknowledged on the call at the time they are made.

Post Incident Analysis

Did we have other events in the past with the same root cause?
1. Unlikely.
Do we have existing backlog items that would've prevented or greatly reduced the impact of this incident?
1. Unlikely.
Was this incident triggered by a change (deployment of code or change to infrastructure)? If yes, link the issue.
1. A production code change.

What went well?

This incident was declared in slack in response to a page about elevated error rates in the ci-runners service.
Release management was quick to suspect that a post-deployment migration was causing the issues and reached out to the appropriate groups.
The developers quickly joined the conference bridge and began discussing routes for investigation to confirm suspected causes and options for remediation.

Guidelines

Blameless RCA Guideline

Edited Jan 18, 2023 by Nels Nelson