Draft: Procedure for fixing PostgreSQL table corruption
Summary
If the PostgreSQL log (usually /var/opt/gitlab/postgresql/current
) contains specific errors suggesting that database pages are not accessible, readable, or are corrupted, it may be necessary to instruct PostgreSQL to discard those pages, and so formally discard some of the data in the table.
ERROR: invalid page in block 2499 of relation base/16401/1730
ERROR: could not read block 67179 in file "base/16401/18053": read only 0 of 8192 bytes
Additionally, you may see the following if a file's actually missing. PostgreSQL cannot fix this using the SET zero_damaged_pages = on
option.
ERROR: could not open file "base/16401/20717": No such file or directory
This procedure WILL cause data loss (or, at least, crystallize data loss) and should be performed only under the guidance of GitLab Support AND as a last resort.
Deployments with repmgr, patroni, or GitLab Geo
It should NOT be performed in a deployment of GitLab that has database replication - via repmgr, patroni, GitLab Geo, or any replication capability provided by your cloud vendor. The status of the table on the other replicas should be checked first, and one of the other replicas promoted to primary. Then the database instance with corruption should be re-initialized from a complete primary.
This is because the corruption would have to occur when the pages were being written to disk, or while they are at rest. The likely ways this could happen ought not to be replicated, as each replica writes these pages out independently, using the write-ahead-logs (WAL) as their single source of truth.
References
Ticket references for GitLab team members. Internal use only:
- https://gitlab.zendesk.com/agent/tickets/217060 (13.12 / PG12.6)
- https://gitlab.zendesk.com/agent/tickets/225997 (13.9.1 / PG12.5)
- https://gitlab.zendesk.com/agent/tickets/230820 (caused by hardware failure)
- https://gitlab.zendesk.com/agent/tickets/268670 (filesystem issues corrupted an index; 14.4.1)
- https://gitlab.zendesk.com/agent/tickets/274740 (caused after GitLab disk was 100% full)
-
#388878 (closed) (missing
product_analytics_events_experimental
table files)
Procedure
- check it's not file permissions.
- Default Omnnibus, the files are stored under
/var/opt/gitlab/postgresql/data
- Check permission bits and ownership to make sure this isn't the cause of the errors.
- Check the server syslog for filesystem errors, and resolve these.
- If the database is on NFS, consider that PostgreSQL guidance which in short assumes that NFS behaves the same as block storage. In practise, there are a lot of performance optimisations that can sacrifice data integrity in some situations. See also our guidance on using NFS with Gitaly which has similar challenges.
-
Verify the issue exists, if necessary, by attempting a database backup.
# in one session sudo gitlab-ctl tail postgresql # in another sudo gitlab-backup create SKIP=uploads,builds,artifacts,lfs,registry,pages,repositories
-
Make some sort of backup. Snapshot the disk, the virtual machine, or the filesystem. Or shut down all of GitLab, including PostgreSQL, and tar up
/var/opt/gitlab/postgresql/data
if that is where the instance's database files are.
- Store this backup completely separately from the PostgreSQL data. PostgreSQL file corruption should not occur, so ask appropriate questions about the integrity of all elements of the storage subsystem.
- If you have file level backups of
/var/opt/gitlab/postgresql/data
, such as a corporate backup solution, this may be a better option for recovering the database than the following procedures.
Scope the problem
Given errors about specific file references (eg: base/16401/17305
) query which objects (table, index) are impacted
# for:
ERROR: invalid page in block 2499 of relation base/16401/17305
# run in SQL:
SELECT pg_filenode_relation(0, 17305);
- Attempt to dump the database.
Attempt to dump the table, and check for errors in the postgresql log (using gitlab-ctl
tail as above). Substitute the table name for tablename
. IF you have multiple tables affected, add more -t tablename
parameters.
## running as root (on GitLab Omnibus)
/opt/gitlab/embedded/bin/chpst -u gitlab-psql:gitlab-psql -U gitlab-psql /usr/bin/env PGSSLCOMPRESSION=0 \
/opt/gitlab/embedded/bin/pg_dump -p 5432 -h /var/opt/gitlab/postgresql \
-d gitlabhq_production >/dev/null
## or
## running as the PostgreSQL unix user
/opt/gitlab/embedded/bin/pg_dump -p 5432 -h /var/opt/gitlab/postgresql \
-d gitlabhq_production >/dev/null
- When it fails, identify the affected table from the output, amend the
pg_dump
command, and exclude the affected table with-T tablename
.
pg_dump #other parameters# -T ci_builds -T projects >/dev/null
- Repeat, adding more tables with
-T
until the dump succeeds. You then know which are the affected tables.
pg_dump procedure after attempting fixes
Later on in the process, after you tried to fix the broken tables, use pg_dump
to check if a table is now readable. Do this by just dumping the specific table(s), with -t tablename
pg_dump #other parameters# -t ci_builds -t projects >/dev/null
And then once all the tables are fixed, pg_dump
with no tables specified should successfully dump the whole database.
fix an index
Do not perform this procedure on a deployment of GitLab that has database replication (or, only as a last resort) - see above
- It may just be a broken index, which is why PostgreSQL might suggest:
ActiveRecord::StatementInvalid (PG::IndexCorrupted: ERROR: index "foo" contains unexpected zero page at block 0
HINT: Please REINDEX it.
):
-
Before attempting any other steps, even if you don't see that error, try re-indexing the table:
SET statement_timeout = 0; REINDEX TABLE tablename;
- If the data isn't accessible to rebuild the index, this is likely to fail.
- Recheck with a
pg_dump
after doing this.
crystallize data loss on a table
Do not perform this procedure on a deployment of GitLab that has database replication (or, only as a last resort) - see above
The procedure for scanning a table and discarding affected pages is as follows:
-
Shut down GitLab (
gitlab-ctl stop
;gitlab-ctl start postgresql
) -
via
gitlab-psql
run the following, substitutingtablename
SET statement_timeout = 0;
SET zero_damaged_pages = on;
VACUUM FULL ANALYZE tablename;
REINDEX TABLE tablename;
QUIT
-
Now repeat the database backup procedure from earlier, while monitoring logs, to verify no other issues are outstanding.
-
gitlab-ctl restart
No such file or directory
fix missing files - Do not perform this procedure on a deployment of GitLab that has database replication (or, only as a last resort) - see above
More than one incident has involved missing files.
In one case, the file got recreated by the PostgreSQL before other intervention was attempted. Data loss occurred on that table.
For the others, they were restored from backup.
In this issue the missing files were recreated with touch
and the permission set correctly.
Then use the pgdump
procedure to verify access.
It was found that there were missing column values for some records in one of the recovered tables. The records were fixed by setting the values to null
. GitLab team members can find more detail in the ticket.