RCA: Excessive buffer read in replicas for ci_job_artifacts
Summary
Service(s) affected: SaaS general performance, MR page errors, timeouts for some actions
Team attribution: grouptesting
Minutes downtime or degradation: 219
Problem
A frequently used index accumulating many dead items that had not yet been vacuumed away caused a performance regression. This was caused by running bulk updates on the ci_job_artifacts table.
Fix
- Because we had a feature flag on the MR that triggered the incident, we were able to stop the initial bleeding
- After investigation, we were able to vacuum the dead items successfully even though we weren't completely certain of the root cause. Running
VACUUM ci_job_artifactson the primary db cleaned up the dead index items. The replicas picked up that change, their index scans immediately became efficient again, returning them to a normal health state.
Impact & Metrics
Start with the following:
| Question | Answer |
|---|---|
| What was the impact | general performance degradation and timeouts for some actions |
| Who was impacted | all users of GitLab SaaS |
| How did this impact customers | Slow site performance, unable to merge changes through the normal merge button |
Patroni Apdex Index During Incident
Detection & Response
Start with the following:
| Question | Answer |
|---|---|
| When was the incident detected? | 2021-11-12 12:52 |
| How was the incident detected? | Team members began experiencing slow response times and Bad Gateway errors. |
| Did alarming work as expected? | Yes |
| How long did it take from the start of the incident to its detection? | Unclear, but within 30 minutes at least. |
| How long did it take from detection to remediation? | 219 minutes |
| What steps were taken to remediate? | Blocked a heavy endpoint via cloudflare. Set ci_destroy_unlocked_job_artifacts feature flag to false. Ran VACUUM FREEZE VERBOSE ANALYZE ci_job_artifacts;
|
| Were there any issues with the response? | No |
MR Checklist
Consider these questions if a code change introduced the issue.
| Question | Answer |
|---|---|
| Was the MR acceptance checklist marked as reviewed in the MR? | yes |
| Should the checklist be updated to help reduce chances of future recurrences? If so, who is the DRI to do so? | No, there are other corrective actions that should be taken instead. The MR that triggered the incident mainly just stepped on an already existing landmine that would have been found sooner or later. |
Timeline
All times UTC.
2021-11-22
-
13:52-@rehabdeclares incident in Slack. -
14:11-@igorwwwwwwwwwwwwwwwwwwwwblocks one of the heavy endpoints via cloudflare. -
14:12-@rehabreverts FFjupyter_clean_diffstofalse. -
14:47-@igorwwwwwwwwwwwwwwwwwwwwenabled the refresh_widget route for the gitlab-com/runbooks project. -
16:09- FFci_destroy_unlocked_job_artifactsset tofalse. -
16:35-@abrandlstarted to runVACUUM FREEZE VERBOSE ANALYZE ci_job_artifacts;. -
17:14- VACUUM finished onindex_ci_job_artifacts_on_job_id_and_file_typewhich has most of the read traffic across indexes on this table. -
17:31- Operation should be returning to normal for all customers. -
17:45- VACUUM finished completely. -
20:00- autodeploy confirmed working again and incident marked as resolved. Work to re-enable disabled FFs is continuing.
2021-11-23
-
16:13- Canary stage is undrained. -
23:30- Sidekiq jobExpireBuildArtifactsWorkeris scheduled to not run: gitlab-com/gl-infra/k8s-workloads/gitlab-com!1368 (merged) -
23:36- increased collection forauto_explaininstrumentation to 10% and >1s.
2021-11-24
-
00:46- Production deploy completed. -
01:17- Post-deploy migrations in production complete.
Root Cause Analysis
What went well
- Because we had a feature flag on the MR that triggered the incident, we were able to stop the initial bleeding
- After investigation, we were able to vacuum the dead items successfully even though we weren't completely certain of the root cause.
- We had recently added the debugging symbols back to postgres which allowed us to efficiently diagnose that the query planner itself (rather than the query executor) was running the expensive index scans.
What can be improved
Tuning autovacuum, partitioning large tables, and pacing bulk data changes are the main proposed preventative measures.
A future release of Postgres may potentially give standby dbs the same hint-based index scan efficiency that the primary db uses, but that is a far future benefit.
Several other ideas were discussed and ultimately discarded due to being infeasible or unreliable or having undesirable side effects.
Corrective actions
| Issue | Description | Timeline | Owners | Completed |
|---|---|---|---|---|
| https://gitlab.com/gitlab-com/gl-infra/infrastructure/-/issues/14723 | Lower the autovacuum thresholds for ci_job_artifacts table |
Short term | infrastructure | |
| https://gitlab.com/gitlab-com/gl-infra/infrastructure/-/issues/13891 | Improve our observability with pgwatch2 | Short term | infrastructure | |
| &6211 | Partition big CI tables | Mid/Long term | grouppipeline execution or grouptesting | |
| #346980 (moved) | Review autovacuum strategy for all types of tables holistically | Mid/Long term | infrastructure |
