Detect and react to imminent PostgreSQL transaction ID exhaustion
Release notes
Problem to solve
It's possible for a self managed GitLab PostgreSQL database to exhaust the ~2 billion transaction IDs and for the database to go read-only.
This was the root cause of a customer emergency (GitLab team members can read more in the ticket):
- An index was corrupted on the
ci_builds
table- This table has about 12 million dead tuples.
- When querying by dead tuples
desc
the next table in the list had orders of magnitude fewer dead tuples. - We assume that this table had managed to pin the majority of the 2 billion transaction IDs across its live and dead tuples.
- We identified the issue by manually running
vacuum freeze verbose public.ci_builds;
shortly after starting up the database, beforeautovacuum
had acquired the lock - PostgreSQL prioritised processing the indexes over the table itself (see output) but terminated when the corrupt index was found. I've modified the output to substitute
1234
for the actual number in case this exposes information about the customer's use of the product. The ticket has the real number. -
vacuum
of the table itself was therefore never occurring. - We suspect this had been running in a loop since the corruption occurred, and that nothing is logged.
gitlabhq_production=# vacuum freeze verbose public.ci_builds;
INFO: aggressively vacuuming "public.ci_builds"
INFO: scanned index "index_ci_builds_on_auto_canceled_by_id" to remove 1234 row versions
DETAIL: CPU: user: 0.87 s, system: 0.39 s, elapsed: 6.41 s
INFO: scanned index "index_ci_builds_on_commit_id_and_stage_idx_and_created_at" to remove 1234 row versions
DETAIL: CPU: user: 0.67 s, system: 0.20 s, elapsed: 4.68 s
INFO: scanned index "index_ci_builds_on_commit_id_and_status_and_type" to remove 1234 row versions
DETAIL: CPU: user: 0.82 s, system: 0.60 s, elapsed: 12.00 s
INFO: scanned index "index_ci_builds_on_commit_id_and_type_and_name_and_ref" to remove 1234 row versions
DETAIL: CPU: user: 1.10 s, system: 1.13 s, elapsed: 25.79 s
INFO: scanned index "index_ci_builds_on_commit_id_and_type_and_ref" to remove 1234 row versions
DETAIL: CPU: user: 0.80 s, system: 0.57 s, elapsed: 27.54 s
INFO: scanned index "index_ci_builds_on_status_and_type_and_runner_id" to remove 1234 row versions
DETAIL: CPU: user: 0.95 s, system: 0.37 s, elapsed: 18.68 s
INFO: scanned index "index_ci_builds_on_token_encrypted" to remove 1234 row versions
DETAIL: CPU: user: 1.54 s, system: 0.39 s, elapsed: 19.64 s
INFO: scanned index "index_ci_builds_on_updated_at" to remove 1234 row versions
DETAIL: CPU: user: 0.92 s, system: 0.35 s, elapsed: 20.12 s
INFO: scanned index "index_ci_builds_on_user_id" to remove 1234 row versions
DETAIL: CPU: user: 0.99 s, system: 0.35 s, elapsed: 18.32 s
INFO: scanned index "index_ci_builds_project_id_and_status_for_live_jobs_partial2" to remove 1234 row versions
DETAIL: CPU: user: 0.18 s, system: 0.08 s, elapsed: 3.87 s
INFO: scanned index "partial_index_ci_builds_on_scheduled_at_with_scheduled_jobs" to remove 1234 row versions
DETAIL: CPU: user: 0.00 s, system: 0.00 s, elapsed: 0.00 s
INFO: scanned index "index_ci_builds_on_upstream_pipeline_id" to remove 1234 row versions
DETAIL: CPU: user: 0.00 s, system: 0.00 s, elapsed: 0.00 s
INFO: scanned index "ci_builds_gitlab_monitor_metrics" to remove 1234 row versions
DETAIL: CPU: user: 0.89 s, system: 0.46 s, elapsed: 20.94 s
ERROR: failed to re-find parent key in index "index_ci_builds_on_project_id_and_name_and_ref" for deletion target page 13652
Proposal
Rails should
- Check periodically or check at startup whether the database is approaching transaction ID roll around
- A period check is probably better, as this check can then also turn off the warnings if they've been turned on
If a threshold is passed, then prior to PostgreSQL intervening, GitLab should take steps automatically, such as:
- Generate a message in the UI to administrators alerting them to the problem (eg: at 80%)
- Generate a message in the UI to all users alerting them to the problem (eg: at 90%)
- Switch the instance to read-only (eg: at 95%)
- Additionally, it should log about the issue.
I suggest this feature should not operate on GitLab.com or, if it does, only generate the administrator warning.
GitLab should take these actions prior to reaching the PostgreSQL limit of one million transactions:
WARNING: database "gitlabhq_production" must be vacuumed within 1000000 transactions
HINT: To avoid a database shutdown, execute a database-wide VACUUM in that database.
You might also need to commit or roll back old prepared transactions, or drop stale replication slots.
Intended users
Feature Usage Metrics
This page may contain information related to upcoming products, features and functionality. It is important to note that the information presented is for informational purposes only, so please do not rely on the information for purchasing or planning purposes. Just like with all projects, the items mentioned on the page are subject to change or delay, and the development, release, and timing of any products, features, or functionality remain at the sole discretion of GitLab Inc.