RCA: Excessive buffer read in replicas for ci_job_artifacts

Summary

Service(s) affected: SaaS general performance, MR page errors, timeouts for some actions
Team attribution: grouptesting
Minutes downtime or degradation: 219

Problem

A frequently used index accumulating many dead items that had not yet been vacuumed away caused a performance regression. This was caused by running bulk updates on the ci_job_artifacts table.

Fix

  • Because we had a feature flag on the MR that triggered the incident, we were able to stop the initial bleeding
  • After investigation, we were able to vacuum the dead items successfully even though we weren't completely certain of the root cause. Running VACUUM ci_job_artifacts on the primary db cleaned up the dead index items. The replicas picked up that change, their index scans immediately became efficient again, returning them to a normal health state.

Impact & Metrics

Start with the following:

Question Answer
What was the impact general performance degradation and timeouts for some actions
Who was impacted all users of GitLab SaaS
How did this impact customers Slow site performance, unable to merge changes through the normal merge button

Patroni Apdex Index During Incident

patrni-apdex

Detection & Response

Start with the following:

Question Answer
When was the incident detected? 2021-11-12 12:52
How was the incident detected? Team members began experiencing slow response times and Bad Gateway errors.
Did alarming work as expected? Yes
How long did it take from the start of the incident to its detection? Unclear, but within 30 minutes at least.
How long did it take from detection to remediation? 219 minutes
What steps were taken to remediate? Blocked a heavy endpoint via cloudflare. Set ci_destroy_unlocked_job_artifacts feature flag to false. Ran VACUUM FREEZE VERBOSE ANALYZE ci_job_artifacts;
Were there any issues with the response? No

MR Checklist

Consider these questions if a code change introduced the issue.

Question Answer
Was the MR acceptance checklist marked as reviewed in the MR? yes
Should the checklist be updated to help reduce chances of future recurrences? If so, who is the DRI to do so? No, there are other corrective actions that should be taken instead. The MR that triggered the incident mainly just stepped on an already existing landmine that would have been found sooner or later.

Timeline

All times UTC.

2021-11-22

  • 13:52 - @rehab declares incident in Slack.
  • 14:11 - @igorwwwwwwwwwwwwwwwwwwww blocks one of the heavy endpoints via cloudflare.
  • 14:12 - @rehab reverts FF jupyter_clean_diffs to false.
  • 14:47 - @igorwwwwwwwwwwwwwwwwwwww enabled the refresh_widget route for the gitlab-com/runbooks project.
  • 16:09 - FF ci_destroy_unlocked_job_artifacts set to false.
  • 16:35 - @abrandl started to run VACUUM FREEZE VERBOSE ANALYZE ci_job_artifacts;.
  • 17:14 - VACUUM finished on index_ci_job_artifacts_on_job_id_and_file_type which has most of the read traffic across indexes on this table.
  • 17:31 - Operation should be returning to normal for all customers.
  • 17:45 - VACUUM finished completely.
  • 20:00 - autodeploy confirmed working again and incident marked as resolved. Work to re-enable disabled FFs is continuing.

2021-11-23

  • 16:13 - Canary stage is undrained.
  • 23:30 - Sidekiq job ExpireBuildArtifactsWorker is scheduled to not run: gitlab-com/gl-infra/k8s-workloads/gitlab-com!1368 (merged)
  • 23:36 - increased collection for auto_explain instrumentation to 10% and >1s.

2021-11-24

  • 00:46 - Production deploy completed.
  • 01:17 - Post-deploy migrations in production complete.

Root Cause Analysis

What went well

  • Because we had a feature flag on the MR that triggered the incident, we were able to stop the initial bleeding
  • After investigation, we were able to vacuum the dead items successfully even though we weren't completely certain of the root cause.
  • We had recently added the debugging symbols back to postgres which allowed us to efficiently diagnose that the query planner itself (rather than the query executor) was running the expensive index scans.

What can be improved

Tuning autovacuum, partitioning large tables, and pacing bulk data changes are the main proposed preventative measures.

A future release of Postgres may potentially give standby dbs the same hint-based index scan efficiency that the primary db uses, but that is a far future benefit.

Several other ideas were discussed and ultimately discarded due to being infeasible or unreliable or having undesirable side effects.

Corrective actions

Issue Description Timeline Owners Completed
https://gitlab.com/gitlab-com/gl-infra/infrastructure/-/issues/14723 Lower the autovacuum thresholds for ci_job_artifacts table Short term infrastructure
https://gitlab.com/gitlab-com/gl-infra/infrastructure/-/issues/13891 Improve our observability with pgwatch2 Short term infrastructure
&6211 Partition big CI tables Mid/Long term grouppipeline execution or grouptesting
#346980 (moved) Review autovacuum strategy for all types of tables holistically Mid/Long term infrastructure
Edited Dec 06, 2021 by Scott Hampton
Assignee Loading
Time tracking Loading