Skip to content

Investigate buffer_mapping contention issue from 2021-11-22 incident

We had a database incident that caused an severity1 outage in gitlab-com/gl-infra/production#5952 (closed).

This issue attempts to discuss possible theories and track reproduction steps/experiments for that issue.

Participants

Below is a list of team members who will be participating in this investigation and identification of root cause.

role name region notes
Management lead @shampton AMER DRI of overall effort. Provide updates in R&S standup, etc.
Development lead @grzesiek EMEA Investigate technical root cause
Development lead @stanhu AMER Investigate technical root cause
Development lead @engwan APAC Investigate technical root cause
Development lead @abrandl EMEA Investigate technical root cause. Andreas is evaluating the extent of his involvement with his manager.
2021-11-25 R&S standup update @grzesiek EMEA
2021-11-30 R&S standup update @cdu1 AMER
2021-12-02 R&S standup update @alexives AMER
after 2021-12-02 R&S standup updates @shampton AMER

Summary

  1. Many old rows in ci_job_artifacts were deleted, but since there are over 1.4 billion rows in that table, the autovacuum threshold (7+ million rows) was not reached. As a result, there were a lot of dead tuples with low job_id values.
  2. These old rows need to be scanned when the query planner calculates min(job_id).
  3. We now understand why the excessive buffer read only happens to replicas. When planning a min(job_id) query on primaries, the primary opportunistically marks dead tuples with a LP_DEAD hint bit. This significantly reduces future buffer reads, but these hint bits are ignored by replicas at the moment. There is a PostgreSQL patch in review that should fix this, but the earliest it could land is PostgreSQL 15.
  4. Based on the analysis above, an index vacuum would avoid the excessive buffer read.
  5. Suggested corrective actions -
    1. Short term: Lower the autovacuum thresholds for ci_job_artifacts table: https://gitlab.com/gitlab-com/gl-infra/infrastructure/-/issues/14723
    2. Short term: Improve our observability with pgwatch2: https://gitlab.com/gitlab-com/gl-infra/infrastructure/-/issues/13891#note_741922183
    3. Mid/Long term: Partition big CI tables (related links below)
      • Draft: Add the CI/CD data time-decay blueprint - !70052 (merged)
      • Reduce tables sizes to < 100 GB per physical table - &6211
    4. Mid/Long term: Review autovacuum strategy for all types of tables holistically: #346980 (moved)
Edited by Alex Ives