Incident Review: Degraded performance on gitlab.com

Incident Review

The DRI for the incident review is the issue assignee.

If applicable, ensure that the exec summary is completed at the top of the associated incident issue, the timeline tab is updated and relevant graphs are included.
If there are any corrective actions or infradev issues, ensure they are added as related issues to the original incident.
Fill out relevant sections below or link to the meeting review notes that cover these topics

Who was impacted by this incident? (i.e. external customers, internal customers)
1. Users on gitlab.com
What was the customer experience during the incident? (i.e. preventing them from doing X, incorrect display of Y, ...)
1. Users reported slow requests and not being able to load their CI/CD pipelines, overall degraded performance on Gitlab.com
How many customers were affected?
1. Everyone on gitlab.com
If a precise customer impact number is unknown, what is the estimated impact (number and ratio of failed requests, amount of traffic drop, ...)?
1. ...
Duration
1. It took us about ~1hour to figure out where the problem as coming from.
2. ~3hours - 14:15 utc - 17:57 utc

We introduced a bug fix which was unlocking a lot of pipelines cascading to unlocks more job artifacts afterwards.

This bug fix had a slow query that updates hundreds of rows on ci_pipelines and ci_job_artifacts table and increased dead tuples on these tables.

This caused a datatase saturation leading to a large increase in 500s site-wide and degradation of gitlab.com.

How was the incident detected?
1. Alert on replication lag in CI database https://gitlab.slack.com/archives/C101F3796/p1680101410420179
2. Incident was declared https://gitlab.slack.com/archives/C101F3796/p1680102335670589
How could detection time be improved?
1. ...
How was the root cause diagnosed?
1. Increased pg_stat_activity from the Sidekiq worker was detected This worker was then traced back to the MR that introduced it.
How could time to diagnosis be improved? 1.
How did we reach the point where we knew how to mitigate the impact?
1. Once the MR that introduced the worker was introduced, rollback was started.
How could time to mitigation be improved?
1. Feature flagging the new worker so that we could revert to the previous worker without having to rollback.

Did we have other events in the past with the same root cause?
1. None that we are aware of.
Do we have existing backlog items that would've prevented or greatly reduced the impact of this incident?
1. No as we accumulated a lot of artifacts over the years and Fix unlocking of job artifacts when pipelines s... (gitlab-org/gitlab!114426 - merged) was supposed to fix this going forward. We did not anticipate well how big the number of accumulated artifacts would be which is really hard to determine.
Was this incident triggered by a change (deployment of code or change to infrastructure)? If yes, link the issue.
1. bug fix to fix these issues gitlab-org/gitlab#387087 (closed), gitlab-org/gitlab#266958 (closed)

SRE did a great job at identifying the worker causing the degradation
We communicated to users once the fix was deployed and things went back to normal
Database replica on new N2 hardware handled read-only queries on its own very well for the period that it was up.
main/ci database separation limited the impact to CI queries.

Edited Apr 05, 2023 by Albert