RCA: Excessive buffer read in replicas for ci_job_artifacts

Summary

Service(s) affected: SaaS general performance, MR page errors, timeouts for some actions
Team attribution: grouptesting
Minutes downtime or degradation: 219

Problem

A frequently used index accumulating many dead items that had not yet been vacuumed away caused a performance regression. This was caused by running bulk updates on the ci_job_artifacts table.

Fix

Because we had a feature flag on the MR that triggered the incident, we were able to stop the initial bleeding
After investigation, we were able to vacuum the dead items successfully even though we weren't completely certain of the root cause. Running VACUUM ci_job_artifacts on the primary db cleaned up the dead index items. The replicas picked up that change, their index scans immediately became efficient again, returning them to a normal health state.

Impact & Metrics

Start with the following:

Question	Answer
What was the impact	general performance degradation and timeouts for some actions
Who was impacted	all users of GitLab SaaS
How did this impact customers	Slow site performance, unable to merge changes through the normal `merge` button

Patroni Apdex Index During Incident

Detection & Response

Start with the following:

Question	Answer
When was the incident detected?	2021-11-12 12:52
How was the incident detected?	Team members began experiencing slow response times and Bad Gateway errors.
Did alarming work as expected?	Yes
How long did it take from the start of the incident to its detection?	Unclear, but within 30 minutes at least.
How long did it take from detection to remediation?	219 minutes
What steps were taken to remediate?	Blocked a heavy endpoint via cloudflare. Set `ci_destroy_unlocked_job_artifacts` feature flag to false. Ran `VACUUM FREEZE VERBOSE ANALYZE ci_job_artifacts;`
Were there any issues with the response?	No

MR Checklist

Consider these questions if a code change introduced the issue.

Question	Answer
Was the MR acceptance checklist marked as reviewed in the MR?	yes
Should the checklist be updated to help reduce chances of future recurrences? If so, who is the DRI to do so?	No, there are other corrective actions that should be taken instead. The MR that triggered the incident mainly just stepped on an already existing landmine that would have been found sooner or later.

Timeline

All times UTC.

2021-11-22

13:52 - @rehab declares incident in Slack.
14:11 - @igorwwwwwwwwwwwwwwwwwwww blocks one of the heavy endpoints via cloudflare.
14:12 - @rehab reverts FF jupyter_clean_diffs to false.
14:47 - @igorwwwwwwwwwwwwwwwwwwww enabled the refresh_widget route for the gitlab-com/runbooks project.
16:09 - FF ci_destroy_unlocked_job_artifacts set to false.
16:35 - @abrandl started to run VACUUM FREEZE VERBOSE ANALYZE ci_job_artifacts;.
17:14 - VACUUM finished on index_ci_job_artifacts_on_job_id_and_file_type which has most of the read traffic across indexes on this table.
17:31 - Operation should be returning to normal for all customers.
17:45 - VACUUM finished completely.
20:00 - autodeploy confirmed working again and incident marked as resolved. Work to re-enable disabled FFs is continuing.

2021-11-23

16:13 - Canary stage is undrained.
23:30 - Sidekiq job ExpireBuildArtifactsWorker is scheduled to not run: gitlab-com/gl-infra/k8s-workloads/gitlab-com!1368 (merged)
23:36 - increased collection for auto_explain instrumentation to 10% and >1s.

2021-11-24

00:46 - Production deploy completed.
01:17 - Post-deploy migrations in production complete.

Root Cause Analysis

What went well

Because we had a feature flag on the MR that triggered the incident, we were able to stop the initial bleeding
After investigation, we were able to vacuum the dead items successfully even though we weren't completely certain of the root cause.
We had recently added the debugging symbols back to postgres which allowed us to efficiently diagnose that the query planner itself (rather than the query executor) was running the expensive index scans.

What can be improved

Tuning autovacuum, partitioning large tables, and pacing bulk data changes are the main proposed preventative measures.

A future release of Postgres may potentially give standby dbs the same hint-based index scan efficiency that the primary db uses, but that is a far future benefit.

Several other ideas were discussed and ultimately discarded due to being infeasible or unreliable or having undesirable side effects.

Corrective actions

Issue	Description	Timeline	Owners
https://gitlab.com/gitlab-com/gl-infra/infrastructure/-/issues/14723	Lower the autovacuum thresholds for `ci_job_artifacts` table	Short term	infrastructure
https://gitlab.com/gitlab-com/gl-infra/infrastructure/-/issues/13891	Improve our observability with pgwatch2	Short term	infrastructure
&6211	Partition big CI tables	Mid/Long term	grouppipeline execution or grouptesting
#346980 (moved)	Review autovacuum strategy for all types of tables holistically	Mid/Long term	infrastructure

Edited Dec 06, 2021 by Scott Hampton