Draft: Procedure for fixing index corruption caused by glibc collation change

Warnings and issues

Use of the procedures in this issue requires knowledge of the GitLab database schema, how GitLab stores its data, relationships between tables, and is not exhaustive.
This issue is written for GitLab Support, as guidance for how to resolve this issue.
Customers with subscriptions must not follow this process independently, since direct modification of the GitLab database and data is out of scope for support.
To do: pm_packages and its related tables are too large for the FK update queries to complete on a reasonable timeframe. It looks like PostgreSQL picks a table scan query plan. A batching strategy needs to be devised.
To do: handle duplicates on deployment_merge_requests - environments procedure
- deployment_merge_requests has a composite primary key, not an id column, so the create temp table duplicates_two operation fails, preventing the work from completing

Summary

If you are up to speed with this problem and just want the key steps

Reindex the database ('Code for reindex table by table')
Analyse the output and identify which unique indexes failed
Fix the affected tables. ('Table specific procedures')

Problem description

glibc collation changed in version 2.28. That is, things started sorting differently.

Concequence: indexes a structured differently on servers running earlier versions when compared to later versions.

For example: Ubuntu 18.04 and RHEL7/Centos7 used the earlier version, all later releases used later version.

Customers should follow our documented procedure for upgrading. Ideally, a dump and restore of PostgreSQL should be performed, but if that isn't an option, a full reindex MUST be performed immediately. GitLab must not be allowed to write to the database until this is done.

In place upgrades or any other mechanisms that preserves the on-disk binary PostgreSQL data files will otherwise result in index corruption, intermittent issues with data not being accessible, and other issues like states being incorrect because the original records can't be found.

Analogy: It's like the order in the index at the back of a book has changed.

In earlier versions, indexes started at J and ran to Z, followed by A to I. In later releases, indexes are written A to Z.
When PostgreSQL on a newer system accesses an old index, either to read or write any of the index entries from A to I, it expects to find them at the front of the index.
The index entries for A to I continue to exist at the end of the index, but they're no longer used because they're in the wrong place.
Queries for data that use the index therefore find no results. Hypothetically, this could mean that all users starting A to I cease to exist, for example.
When new entries are written to the table, the front of the index is updated.
Unique indexes are the biggest issue since they place a constraint on the table that assures uniqueness. However, because A to I are not in the right place in the index, new duplicate records are inserted into the table and indexes in the right place. The original records are still present in the table.
- Non-unique indexes can be rebuilt later on - all the hidden A to I rows will be found and correctly indexed along side the new rows.

This issue will address both non-unique indexes and unique indexes.

Unique indexes

Unique indexes will fail to rebuild until the duplicate records are removed.

The main unique indexes are table primary keys: most of the 500+ GitLab tables have one. These are mostly or all numeric, and are not affected.

Generally, the affected unique indexes are for strings like email addresses, environment names, branches etc.

Foreign key relationships.

The non unique records propagate to other tables, and this is the hardest part that needs dealing with, since data loss can occur if it's not handled correctly.

For example: when a customer is using environments in their CI, the primary key for the record on the environments table is written into one of the fields in the pipelines table.

Example

This is simplified: the actual index on the pipelines table that is affected has other columns as well,and other columns are also not shown.

environments

id	environment_name	state
77	production	down
1239	production	up

pipelines

id	project_id	environment_id
58	25	77
66	25	77
89	25	77
96	25	77
257	25	1239
263	25	1239

To ensure that all environment names are unique, one of the two records need to be deleted.

For the purposes of this example, there's a foreign key constraint between the two tables that specifies

Any pipelines.environment_id value must be present in the environments.id column.
If an environment record is deleted, the corresponding pipeline must also be deleted.

It's therefore necessary to:

'pick a winner' on the environments table, for example: 1239
update the pipelines table to replace references to 77 with 1239
then 77 can be deleted.

Some of the tables have multiple foreign keys, and consideration also has to be given to other data stored in the records. In the example, state information is stored about environments. Picking 77 would cause the environment to switch from up to down.

So, generally the most recent record is picked:

So the current row, and all its attributes, is preserved.
Because analysis for this change will take some time, and while that's happening more records get written.
If the change is made with GitLab up, more records get written. It may be impossible to stop the new records from being written to other tables, and so the more recent record (that the index knows about) is picked.
Downside: this can cause a lot of writes to large tables - causing disk IO and other potential issues.

Root cause

In place upgrade of Linux from Ubuntu 18.04 to a later release, or Centos 7 to a later release. Or another distribution that used glibc 2.27 or later to a subsequent release.
Switching containerised PostgreSQL between base container images that use incompatible glibc releases.
Using Geo, Patroni, or any other PostgreSQL replication to migrate between Linux distribution versions. PostgreSQL replication is binary, and so the original glibc collation/sort is preserved.

Geo considerations

PostgreSQL replication that underpins Geo will replicate these changes: if secondary sites are not also upgraded to the later glibc version this should be done ASAP since older glibc is equally incompatible with the new version.

Concurrent reindexing, and NOT VALID _ccnew indexes.

When GitLab is running, indexes must be (re)created CONCURRENTLY.

Writes to the table are prevented if this isn't done.

Internally, PostgreSQL achieves this by

Creating a new index in parallel, named the same as the original index with suffix _ccnew (if this exists, a number is appended)
When the new index is created, a the table is made read only very briefly to swap around the indexes.
The old index is deleted.

If the operation is cancelled because the session is terminated (eg: SSH timeout) or because one of the indexes fails, the _ccnew indexes are left behind in NOT VALID state.

They need to be removed.

Fix non-unique issues and scope the problem

The first step is to re-index the whole database and can be done immediately.

This will

Rebuild all non-unique indexes and address the issue of records not being returned by these indexes.
It will rebuild any unique indexes which can be.
It will also identify which unique indexes cannot be rebuilt, scoping what other work needs doing.

On the face of it, the following would be sufficient, but this command fails on the first invalid index.

REINDEX DATABASE CONCURRENTLY gitlabhq_production;

Instead, use the below code where each table will be re-index in sequence, and then reindex.out analysed to identify failures.

Run this on the server that runs the PostgreSQL instance - the GitLab server if it's a single node install, the Patroni leader. If the environment uses Geo, run it on the primary site. All changes will get replicated by PostgreSQL to Patroni replicas and Geo secondary sites.
The output will list the tables as the reindex progresses, with errors for each table when they occur.
This operations must be protected and allowed to complete. Use screen or some other mechanism to ensure your session is not terminated by SSH timeouts, network glitches etc. If the session drops, you'll need to DROP the partially recreated _ccnew indexes it will leave behind.

Code for reindex table by table

Run this in a screen or tmux session to ensure it completes.

echo "SELECT  concat_ws('.',schemaname,tablename)  FROM pg_catalog.pg_tables where schemaname in ('public','gitlab_partitions_static','gitlab_partitions_dynamic')" | gitlab-psql | \
    egrep '^ public|^ gitlab' | \
while read t
  do
    echo "#### TABLE: $t"
    echo "SET statement_timeout = 0;  REINDEX TABLE CONCURRENTLY $t"| gitlab-psql 2>&1
  done | \
     tee -a reindex.out

Fix unique indexes

There's detailed steps for each affected index/table, to account for different foreign key relationships, and how the product features that use that data behave.

Reads over the affected table to need to bypass the affected index, since it is unable to return accurate results, so index scanning is disabled as needed:

set enable_indexscan TO off;  
set enable_bitmapscan TO off;

The general process is

Figure out the scale of duplication on the affected table.
Figure out which foreign keys are also affected; that is, how many records in other tables will need modifying first, and which tables (if any) have no affected records.
Fix the records in the other tables.
- The non-uniqueness is expected to comprise pairs of rows: the old record and the new one - one created when old glibc was in use, and one for new glibc.
- If there are additional duplicates, analysis is required to work out why. If the unique index shows as NOT VALID then it might not be enforcing uniqueness at all.
Verify no more records on other tables need fixing.
Delete records on the table to restore uniqueness.
Reindex the table.

Zooming out, there are tables which are easier to fix, for example because there are no foreign keys. And other tables with a lot of foreign keys that are harder fix.

Reduce the number of affected tables as quickly as possible by doing the easy tables first.