gitlab-ctl pg-upgrade -V version doesn't clean up the backup directory created during rollback causing the command to fail when ran again

Summary

This issue was noticed while upgrading postgreSQL from 13.11 to 14.8

When gitlab-ctl pg-upgrade fails in between and reverts the upgrade, running the script again gives error.

Steps to reproduce

  • Create a 3k omnibus instance with GET
  • Once the instance is up and running upgrade the postgresql by following the upgrade process mentioned here
  • Run the gitlab-ctl pg-upgrade -v 14 on replica first before the master so that the script fails in between and the rollback is triggered
  • Run the gitlab-ctl pg-upgrade -v 14 command again on the same replica.

What is the current bug behavior?

  • There is a /var/opt/gitlab/postgresql/data.13 created during the upgrade process. When the command gitlab-ctl pg-upgrade -v 14 fails after creating this folder, and the user retries the command, the upgrade fails with multiple errors like so
   /opt/gitlab/embedded/lib/ruby/3.0.0/fileutils.rb:1415:in `initialize': No such file or directory @ rb_sysopen - /var/opt/gitlab/postgresql/data.14/patroni.dynamic.json (Errno::ENOENT)
        from /opt/gitlab/embedded/lib/ruby/3.0.0/fileutils.rb:1415:in `open'
        from /opt/gitlab/embedded/lib/ruby/3.0.0/fileutils.rb:1415:in `block in copy_file'
        from /opt/gitlab/embedded/lib/ruby/3.0.0/fileutils.rb:1414:in `open'
        from /opt/gitlab/embedded/lib/ruby/3.0.0/fileutils.rb:1414:in `copy_file'
        from /opt/gitlab/embedded/lib/ruby/3.0.0/fileutils.rb:514:in `copy_file'
        from /opt/gitlab/embedded/service/omnibus-ctl/pg-upgrade.rb:732:in `copy_patroni_dynamic_config'
        from /opt/gitlab/embedded/service/omnibus-ctl/pg-upgrade.rb:318:in `common_post_upgrade'
        from /opt/gitlab/embedded/service/omnibus-ctl/pg-upgrade.rb:369:in `patroni_replica_upgrade'
        from /opt/gitlab/embedded/service/omnibus-ctl/pg-upgrade.rb:282:in `block in load_file'
        from /opt/gitlab/embedded/lib/ruby/gems/3.0.0/gems/omnibus-ctl-0.6.0/lib/omnibus-ctl.rb:204:in `block in add_command_under_category'
        from /opt/gitlab/embedded/lib/ruby/gems/3.0.0/gems/omnibus-ctl-0.6.0/lib/omnibus-ctl.rb:746:in `run'
        from /opt/gitlab/embedded/lib/ruby/gems/3.0.0/gems/omnibus-ctl-0.6.0/bin/omnibus-ctl:31:in `<top (required)>'
        from /opt/gitlab/embedded/bin/omnibus-ctl:25:in `load'
        from /opt/gitlab/embedded/bin/omnibus-ctl:25:in `<main>'

What is the expected correct behavior?

  • Any directories like /var/opt/gitlab/postgresql/data.13 directory that is created during the upgrade process, should be cleaned up during the rollback process
    • Or there needs to be a validation during the upgrade process to check that there are no data.x directories available during the upgrade

Relevant logs

The following error occurs in patroni logs at /var/log/gitlab/patroni/current

Relevant logs

2023-09-28_02:37:10.50626 2023-09-28 02:37:10,505 INFO: doing crash recovery in a single user mode
2023-09-28_02:37:10.50684 2023-09-28 02:37:10,506 ERROR: Error when reading postmaster.opts
2023-09-28_02:37:10.50685 Traceback (most recent call last):
2023-09-28_02:37:10.50686   File "/opt/gitlab/embedded/lib/python3.9/site-packages/patroni/postgresql/rewind.py", line 383, in read_postmaster_opts
2023-09-28_02:37:10.50686     with open(os.path.join(self._postgresql.data_dir, 'postmaster.opts')) as f:
2023-09-28_02:37:10.50687 FileNotFoundError: [Errno 2] No such file or directory: '/var/opt/gitlab/postgresql/data/postmaster.opts'
2023-09-28_02:37:10.51588 2023-09-28 02:37:10,514 ERROR: Crash recovery finished with code=1
2023-09-28_02:37:10.51592 2023-09-28 02:37:10,515 INFO:  stdout=
2023-09-28_02:37:10.51602 2023-09-28 02:37:10,515 INFO:  stderr=FATAL:  database files are incompatible with server
2023-09-28_02:37:10.51602 DETAIL:  The data directory was initialized by PostgreSQL version 13, which is not compatible with this version 14.8.
2023-09-28_02:37:10.51603 
2023-09-28_02:37:11.55687 2023-09-28 02:37:11,555 INFO: Deregister service postgresql-ha/omnibus3k-postgres-1.c.vpatel-d32322c6.internal
2023-09-28_02:37:13.10551 2023-09-28 02:37:13,105 INFO: No PostgreSQL configuration items changed, nothing to reload.
2023-09-28_02:37:13.12405 2023-09-28 02:37:13,110 INFO: Lock owner: omnibus3k-postgres-2.c.vpatel-d32322c6.internal; I am omnibus3k-postgres-1.c.vpatel-d32322c6.internal
2023-09-28_02:37:13.12408 2023-09-28 02:37:13,123 INFO: Deregister service postgresql-ha/omnibus3k-postgres-1.c.vpatel-d32322c6.internal
2023-09-28_02:37:13.12544 2023-09-28 02:37:13,125 INFO: trying to bootstrap from leader 'omnibus3k-postgres-2.c.vpatel-d32322c6.internal'
2023-09-28_02:37:13.14349 2023-09-28 02:37:13,127 INFO: Lock owner: omnibus3k-postgres-2.c.vpatel-d32322c6.internal; I am omnibus3k-postgres-1.c.vpatel-d32322c6.internal
2023-09-28_02:37:13.14351 2023-09-28 02:37:13,142 WARNING: Could not register service: unknown role type uninitialized
2023-09-28_02:37:13.14352 2023-09-28 02:37:13,142 INFO: bootstrap from leader 'omnibus3k-postgres-2.c.vpatel-d32322c6.internal' in progress
2023-09-28_02:37:23.13159 2023-09-28 02:37:23,128 INFO: Lock owner: omnibus3k-postgres-2.c.vpatel-d32322c6.internal; I am omnibus3k-postgres-1.c.vpatel-d32322c6.internal
2023-09-28_02:37:23.13162 2023-09-28 02:37:23,131 INFO: bootstrap from leader 'omnibus3k-postgres-2.c.vpatel-d32322c6.internal' in progress
2023-09-28_02:37:33.13035 2023-09-28 02:37:33,128 INFO: Lock owner: omnibus3k-postgres-2.c.vpatel-d32322c6.internal; I am omnibus3k-postgres-1.c.vpatel-d32322c6.internal
2023-09-28_02:37:33.13040 2023-09-28 02:37:33,130 INFO: bootstrap from leader 'omnibus3k-postgres-2.c.vpatel-d32322c6.internal' in progress
2023-09-28_02:37:43.13081 2023-09-28 02:37:43,128 INFO: Lock owner: omnibus3k-postgres-2.c.vpatel-d32322c6.internal; I am omnibus3k-postgres-1.c.vpatel-d32322c6.internal
2023-09-28_02:37:43.13084 2023-09-28 02:37:43,130 INFO: bootstrap from leader 'omnibus3k-postgres-2.c.vpatel-d32322c6.internal' in progress
2023-09-28_02:37:51.17533 pg_basebackup: error: could not create directory "/var/opt/gitlab/postgresql/data/pg_wal": Permission denied
2023-09-28_02:37:51.17578 pg_basebackup: removing contents of data directory "/var/opt/gitlab/postgresql/data"
2023-09-28_02:37:51.17701 2023-09-28 02:37:51,176 ERROR: Error when fetching backup: pg_basebackup exited with code=1
2023-09-28_02:37:51.17707 2023-09-28 02:37:51,176 WARNING: Trying again in 5 seconds
2023-09-28_02:37:53.13049 2023-09-28 02:37:53,128 INFO: Lock owner: omnibus3k-postgres-2.c.vpatel-d32322c6.internal; I am omnibus3k-postgres-1.c.vpatel-d32322c6.internal
2023-09-28_02:37:53.13053 2023-09-28 02:37:53,130 INFO: bootstrap from leader 'omnibus3k-postgres-2.c.vpatel-d32322c6.internal' in progress

Workaround

The workaround was to

  1. stop the patroni on master
  2. run gitlab-ctl pg-upgrade on master
  3. delete the var/opt/gitlab/postgresql/data.x on replica 1
  4. run gitlab-ctl pg-upgrade on replica 1
  5. Continue steps 3 and 4 on other replicas

Details of package version

Provide the package version installation details