Incident Review: Degraded performance on gitlab.com
Incident Review
The DRI for the incident review is the issue assignee.
-
If applicable, ensure that the exec summary is completed at the top of the associated incident issue, the timeline tab is updated and relevant graphs are included. -
If there are any corrective actions or infradev issues, ensure they are added as related issues to the original incident. -
Fill out relevant sections below or link to the meeting review notes that cover these topics
Customer Impact
-
Who was impacted by this incident? (i.e. external customers, internal customers)
- Users on gitlab.com
-
What was the customer experience during the incident? (i.e. preventing them from doing X, incorrect display of Y, ...)
- Users reported slow requests and not being able to load their CI/CD pipelines, overall degraded performance on Gitlab.com
-
How many customers were affected?
- Everyone on gitlab.com
-
If a precise customer impact number is unknown, what is the estimated impact (number and ratio of failed requests, amount of traffic drop, ...)?
- ...
-
Duration
- It took us about ~1hour to figure out where the problem as coming from.
-
~3hours
-14:15 utc - 17:57 utc
What were the root causes?
We introduced a bug fix which was unlocking a lot of pipelines cascading to unlocks more job artifacts afterwards.
This bug fix had a slow query that updates hundreds of rows on ci_pipelines
and ci_job_artifacts
table and increased dead tuples on these tables.
This caused a datatase saturation leading to a large increase in 500s site-wide and degradation of gitlab.com.
Incident Response Analysis
-
How was the incident detected?
- Alert on replication lag in CI database https://gitlab.slack.com/archives/C101F3796/p1680101410420179
- Incident was declared https://gitlab.slack.com/archives/C101F3796/p1680102335670589
-
How could detection time be improved?
- ...
-
How was the root cause diagnosed?
- Increased pg_stat_activity from the Sidekiq worker was detected This worker was then traced back to the MR that introduced it.
- How could time to diagnosis be improved? 1.
-
How did we reach the point where we knew how to mitigate the impact?
- Once the MR that introduced the worker was introduced, rollback was started.
-
How could time to mitigation be improved?
- Feature flagging the new worker so that we could revert to the previous worker without having to rollback.
Post Incident Analysis
-
Did we have other events in the past with the same root cause?
- None that we are aware of.
-
Do we have existing backlog items that would've prevented or greatly reduced the impact of this incident?
- No as we accumulated a lot of artifacts over the years and Fix unlocking of job artifacts when pipelines s... (gitlab-org/gitlab!114426 - merged) was supposed to fix this going forward. We did not anticipate well how big the number of accumulated artifacts would be which is really hard to determine.
-
Was this incident triggered by a change (deployment of code or change to infrastructure)? If yes, link the issue.
- bug fix to fix these issues gitlab-org/gitlab#387087 (closed), gitlab-org/gitlab#266958 (closed)
What went well?
- SRE did a great job at identifying the worker causing the degradation
- We communicated to users once the fix was deployed and things went back to normal
- Database replica on new N2 hardware handled read-only queries on its own very well for the period that it was up.
-
main
/ci
database separation limited the impact toCI
queries.
Guidelines
Edited by Albert