Draft: Procedure for fixing PostgreSQL table corruption

Summary

If the PostgreSQL log (usually /var/opt/gitlab/postgresql/current) contains specific errors suggesting that database pages are not accessible, readable, or are corrupted, it may be necessary to instruct PostgreSQL to discard those pages, and so formally discard some of the data in the table.

ERROR: invalid page in block 2499 of relation base/16401/1730
ERROR: could not read block 67179 in file "base/16401/18053": read only 0 of 8192 bytes

Additionally, you may see the following if a file's actually missing. PostgreSQL cannot fix this using the SET zero_damaged_pages = on option.

ERROR: could not open file "base/16401/20717": No such file or directory

This procedure WILL cause data loss (or, at least, crystallize data loss) and should be performed only under the guidance of GitLab Support AND as a last resort.

Deployments with repmgr, patroni, or GitLab Geo

It should NOT be performed in a deployment of GitLab that has database replication - via repmgr, patroni, GitLab Geo, or any replication capability provided by your cloud vendor. The status of the table on the other replicas should be checked first, and one of the other replicas promoted to primary. Then the database instance with corruption should be re-initialized from a complete primary.

This is because the corruption would have to occur when the pages were being written to disk, or while they are at rest. The likely ways this could happen ought not to be replicated, as each replica writes these pages out independently, using the write-ahead-logs (WAL) as their single source of truth.

References

Ticket references for GitLab team members. Internal use only:

https://gitlab.zendesk.com/agent/tickets/217060 (13.12 / PG12.6)
https://gitlab.zendesk.com/agent/tickets/225997 (13.9.1 / PG12.5)
https://gitlab.zendesk.com/agent/tickets/230820 (caused by hardware failure)
https://gitlab.zendesk.com/agent/tickets/268670 (filesystem issues corrupted an index; 14.4.1)
https://gitlab.zendesk.com/agent/tickets/274740 (caused after GitLab disk was 100% full)
#388878 (closed) (missing product_analytics_events_experimental table files)

Procedure

check it's not file permissions.

Default Omnnibus, the files are stored under /var/opt/gitlab/postgresql/data
Check permission bits and ownership to make sure this isn't the cause of the errors.

Check the server syslog for filesystem errors, and resolve these.

If the database is on NFS, consider that PostgreSQL guidance which in short assumes that NFS behaves the same as block storage. In practise, there are a lot of performance optimisations that can sacrifice data integrity in some situations. See also our guidance on using NFS with Gitaly which has similar challenges.

Verify the issue exists, if necessary, by attempting a database backup.

# in one session
sudo gitlab-ctl tail postgresql
# in another
sudo gitlab-backup create SKIP=uploads,builds,artifacts,lfs,registry,pages,repositories

Make some sort of backup. Snapshot the disk, the virtual machine, or the filesystem. Or shut down all of GitLab, including PostgreSQL, and tar up /var/opt/gitlab/postgresql/data if that is where the instance's database files are.

Store this backup completely separately from the PostgreSQL data. PostgreSQL file corruption should not occur, so ask appropriate questions about the integrity of all elements of the storage subsystem.

If you have file level backups of /var/opt/gitlab/postgresql/data, such as a corporate backup solution, this may be a better option for recovering the database than the following procedures.

Scope the problem

Given errors about specific file references (eg: base/16401/17305) query which objects (table, index) are impacted

# for:

ERROR:  invalid page in block 2499 of relation base/16401/17305

# run in SQL:

SELECT pg_filenode_relation(0, 17305);

Attempt to dump the database.

Attempt to dump the table, and check for errors in the postgresql log (using gitlab-ctl tail as above). Substitute the table name for tablename. IF you have multiple tables affected, add more -t tablename parameters.

## running as root (on GitLab Omnibus)
/opt/gitlab/embedded/bin/chpst -u gitlab-psql:gitlab-psql -U gitlab-psql /usr/bin/env PGSSLCOMPRESSION=0 \
     /opt/gitlab/embedded/bin/pg_dump  -p 5432 -h /var/opt/gitlab/postgresql \
     -d gitlabhq_production   >/dev/null


## or

## running as the PostgreSQL unix user
/opt/gitlab/embedded/bin/pg_dump  -p 5432 -h /var/opt/gitlab/postgresql \
     -d gitlabhq_production   >/dev/null

When it fails, identify the affected table from the output, amend the pg_dump command, and exclude the affected table with -T tablename.

pg_dump #other parameters# -T ci_builds -T projects >/dev/null

Repeat, adding more tables with -T until the dump succeeds. You then know which are the affected tables.

pg_dump procedure after attempting fixes

Later on in the process, after you tried to fix the broken tables, use pg_dump to check if a table is now readable. Do this by just dumping the specific table(s), with -t tablename

 pg_dump #other parameters# -t ci_builds -t projects >/dev/null

And then once all the tables are fixed, pg_dump with no tables specified should successfully dump the whole database.

fix an index

Do not perform this procedure on a deployment of GitLab that has database replication (or, only as a last resort) - see above

It may just be a broken index, which is why PostgreSQL might suggest:

ActiveRecord::StatementInvalid (PG::IndexCorrupted: ERROR: index "foo" contains unexpected zero page at block 0
HINT: Please REINDEX it.
):

Before attempting any other steps, even if you don't see that error, try re-indexing the table:
```
SET statement_timeout = 0;
REINDEX TABLE tablename;
```
- If the data isn't accessible to rebuild the index, this is likely to fail.
- Recheck with a pg_dump after doing this.

crystallize data loss on a table

Do not perform this procedure on a deployment of GitLab that has database replication (or, only as a last resort) - see above

The procedure for scanning a table and discarding affected pages is as follows:

Shut down GitLab (gitlab-ctl stop ; gitlab-ctl start postgresql)
via gitlab-psql run the following, substituting tablename

SET statement_timeout = 0;
SET zero_damaged_pages = on;
VACUUM FULL ANALYZE tablename;
REINDEX TABLE tablename;
QUIT

Now repeat the database backup procedure from earlier, while monitoring logs, to verify no other issues are outstanding.
gitlab-ctl restart

fix missing files - `No such file or directory`

Do not perform this procedure on a deployment of GitLab that has database replication (or, only as a last resort) - see above

More than one incident has involved missing files.

In one case, the file got recreated by the PostgreSQL before other intervention was attempted. Data loss occurred on that table.

For the others, they were restored from backup.

In this issue the missing files were recreated with touch and the permission set correctly.

Then use the pgdump procedure to verify access.

It was found that there were missing column values for some records in one of the recovered tables. The records were fixed by setting the values to null. GitLab team members can find more detail in the ticket.

Edited Apr 22, 2024 by Ben Prescott_

Draft: Procedure for fixing PostgreSQL table corruption

Summary

Deployments with repmgr, patroni, or GitLab Geo

References

Procedure

Scope the problem

pg_dump procedure after attempting fixes

fix an index

crystallize data loss on a table

fix missing files - No such file or directory

fix missing files - `No such file or directory`