Draft: Procedure for fixing index corruption caused by glibc collation change
- Warnings and issues
- Summary
- Root cause
- Geo considerations
- Concurrent reindexing, and NOT VALID _ccnew indexes.
- Fix non-unique issues and scope the problem
- Fix unique indexes
- Table specific procedures
Warnings and issues
- Use of the procedures in this issue requires knowledge of the GitLab database schema, how GitLab stores its data, relationships between tables, and is not exhaustive.
- This issue is written for GitLab Support, as guidance for how to resolve this issue.
- Customers with subscriptions must not follow this process independently, since direct modification of the GitLab database and data is out of scope for support.
-
To do:
pm_packages
and its related tables are too large for the FK update queries to complete on a reasonable timeframe. It looks like PostgreSQL picks a table scan query plan. A batching strategy needs to be devised. -
To do: handle duplicates on
deployment_merge_requests
- environments procedure-
deployment_merge_requests
has a composite primary key, not anid
column, so thecreate temp table duplicates_two
operation fails, preventing the work from completing
-
Summary
If you are up to speed with this problem and just want the key steps
- Reindex the database ('Code for reindex table by table')
- Analyse the output and identify which unique indexes failed
- Fix the affected tables. ('Table specific procedures')
Problem description
glibc
collation changed in version 2.28. That is, things started sorting differently.
Concequence: indexes a structured differently on servers running earlier versions when compared to later versions.
For example: Ubuntu 18.04 and RHEL7/Centos7 used the earlier version, all later releases used later version.
Customers should follow our documented procedure for upgrading. Ideally, a dump and restore of PostgreSQL should be performed, but if that isn't an option, a full reindex MUST be performed immediately. GitLab must not be allowed to write to the database until this is done.
In place upgrades or any other mechanisms that preserves the on-disk binary PostgreSQL data files will otherwise result in index corruption, intermittent issues with data not being accessible, and other issues like states being incorrect because the original records can't be found.
Analogy: It's like the order in the index at the back of a book has changed.
- In earlier versions, indexes started at
J
and ran toZ
, followed byA
toI
. In later releases, indexes are writtenA
toZ
. - When PostgreSQL on a newer system accesses an old index, either to read or write any of the index entries from
A
toI
, it expects to find them at the front of the index. - The index entries for
A
toI
continue to exist at the end of the index, but they're no longer used because they're in the wrong place. - Queries for data that use the index therefore find no results. Hypothetically, this could mean that all users starting
A
toI
cease to exist, for example. - When new entries are written to the table, the front of the index is updated.
- Unique indexes are the biggest issue since they place a constraint on the table that assures uniqueness. However, because
A
toI
are not in the right place in the index, new duplicate records are inserted into the table and indexes in the right place. The original records are still present in the table.- Non-unique indexes can be rebuilt later on - all the hidden
A
toI
rows will be found and correctly indexed along side the new rows.
- Non-unique indexes can be rebuilt later on - all the hidden
This issue will address both non-unique indexes and unique indexes.
Unique indexes
Unique indexes will fail to rebuild until the duplicate records are removed.
The main unique indexes are table primary keys: most of the 500+ GitLab tables have one. These are mostly or all numeric, and are not affected.
Generally, the affected unique indexes are for strings like email addresses, environment names, branches etc.
Foreign key relationships.
The non unique records propagate to other tables, and this is the hardest part that needs dealing with, since data loss can occur if it's not handled correctly.
For example: when a customer is using environments in their CI, the primary key for the record on the environments
table is written into one of the fields in the pipelines table.
Example
This is simplified: the actual index on the pipelines
table that is affected has other columns as well,and other columns are also not shown.
environments
id | environment_name | state |
---|---|---|
77 | production | down |
1239 | production | up |
pipelines
id | project_id | environment_id |
---|---|---|
58 | 25 | 77 |
66 | 25 | 77 |
89 | 25 | 77 |
96 | 25 | 77 |
257 | 25 | 1239 |
263 | 25 | 1239 |
To ensure that all environment names are unique, one of the two records need to be deleted.
For the purposes of this example, there's a foreign key constraint between the two tables that specifies
- Any
pipelines.environment_id
value must be present in theenvironments.id
column. - If an environment record is deleted, the corresponding pipeline must also be deleted.
It's therefore necessary to:
- 'pick a winner' on the
environments
table, for example:1239
- update the
pipelines
table to replace references to77
with1239
- then
77
can be deleted.
Some of the tables have multiple foreign keys, and consideration also has to be given to other data stored in the records. In the example, state information is stored about environments. Picking 77
would cause the environment to switch from up
to down
.
So, generally the most recent record is picked:
- So the current row, and all its attributes, is preserved.
- Because analysis for this change will take some time, and while that's happening more records get written.
- If the change is made with GitLab up, more records get written. It may be impossible to stop the new records from being written to other tables, and so the more recent record (that the index knows about) is picked.
- Downside: this can cause a lot of writes to large tables - causing disk IO and other potential issues.
Root cause
- In place upgrade of Linux from Ubuntu 18.04 to a later release, or Centos 7 to a later release. Or another distribution that used
glibc
2.27 or later to a subsequent release. - Switching containerised PostgreSQL between base container images that use incompatible
glibc
releases. - Using Geo, Patroni, or any other PostgreSQL replication to migrate between Linux distribution versions. PostgreSQL replication is binary, and so the original
glibc
collation/sort is preserved.
Geo considerations
PostgreSQL replication that underpins Geo will replicate these changes: if secondary sites are not also upgraded to the later glibc
version this should be done ASAP since older glibc
is equally incompatible with the new version.
Concurrent reindexing, and NOT VALID _ccnew indexes.
When GitLab is running, indexes must be (re)created CONCURRENTLY
.
Writes to the table are prevented if this isn't done.
Internally, PostgreSQL achieves this by
- Creating a new index in parallel, named the same as the original index with suffix
_ccnew
(if this exists, a number is appended) - When the new index is created, a the table is made read only very briefly to swap around the indexes.
- The old index is deleted.
If the operation is cancelled because the session is terminated (eg: SSH timeout) or because one of the indexes fails, the _ccnew
indexes are left behind in NOT VALID
state.
They need to be removed.
Fix non-unique issues and scope the problem
The first step is to re-index the whole database and can be done immediately.
This will
- Rebuild all non-unique indexes and address the issue of records not being returned by these indexes.
- It will rebuild any unique indexes which can be.
- It will also identify which unique indexes cannot be rebuilt, scoping what other work needs doing.
On the face of it, the following would be sufficient, but this command fails on the first invalid index.
REINDEX DATABASE CONCURRENTLY gitlabhq_production;
Instead, use the below code where each table will be re-index in sequence, and then reindex.out
analysed to identify failures.
- Run this on the server that runs the PostgreSQL instance - the GitLab server if it's a single node install, the Patroni leader. If the environment uses Geo, run it on the primary site. All changes will get replicated by PostgreSQL to Patroni replicas and Geo secondary sites.
- The output will list the tables as the reindex progresses, with errors for each table when they occur.
- This operations must be protected and allowed to complete. Use
screen
or some other mechanism to ensure your session is not terminated by SSH timeouts, network glitches etc. If the session drops, you'll need toDROP
the partially recreated_ccnew
indexes it will leave behind.
Code for reindex table by table
Run this in a screen
or tmux
session to ensure it completes.
echo "SELECT concat_ws('.',schemaname,tablename) FROM pg_catalog.pg_tables where schemaname in ('public','gitlab_partitions_static','gitlab_partitions_dynamic')" | gitlab-psql | \
egrep '^ public|^ gitlab' | \
while read t
do
echo "#### TABLE: $t"
echo "SET statement_timeout = 0; REINDEX TABLE CONCURRENTLY $t"| gitlab-psql 2>&1
done | \
tee -a reindex.out
Fix unique indexes
There's detailed steps for each affected index/table, to account for different foreign key relationships, and how the product features that use that data behave.
Reads over the affected table to need to bypass the affected index, since it is unable to return accurate results, so index scanning is disabled as needed:
set enable_indexscan TO off;
set enable_bitmapscan TO off;
The general process is
- Figure out the scale of duplication on the affected table.
- Figure out which foreign keys are also affected; that is, how many records in other tables will need modifying first, and which tables (if any) have no affected records.
- Fix the records in the other tables.
- The non-uniqueness is expected to comprise pairs of rows: the old record and the new one - one created when old
glibc
was in use, and one for newglibc
. - If there are additional duplicates, analysis is required to work out why. If the unique index shows as
NOT VALID
then it might not be enforcing uniqueness at all.
- The non-uniqueness is expected to comprise pairs of rows: the old record and the new one - one created when old
- Verify no more records on other tables need fixing.
- Delete records on the table to restore uniqueness.
- Reindex the table.
Zooming out, there are tables which are easier to fix, for example because there are no foreign keys. And other tables with a lot of foreign keys that are harder fix.
Reduce the number of affected tables as quickly as possible by doing the easy tables first.
Table specific procedures
merge_request_diff_commit_users
table
Follow thread below for instructions for fixing this table.
topics
table
Follow thread below for instructions for fixing this table.
ci_refs
table
Follow thread below for instructions for fixing this table.
ci_resource_groups
table
Follow thread below for instructions for fixing this table.
environments
table
Follow thread below for instructions for fixing this table.
pm_packages
table
Follow thread below for instructions for fixing this table.
pm_affected_packages
table
Follow thread below for instructions for fixing this table.
sbom_components
table
Follow thread below for instructions for fixing this table.
container_registries
table
Follow thread below for instructions for fixing this table