pg-upgrade fails silently when running analyze_new_cluster.sh
Summary
Two Large Premium customers have reported attempting to upgrade to PG11, noticing very high CPU, load and lots of timeouts - symptoms of unvacuumed/unanalyzed database, especially after the major version upgrade - both ended up reverting the PG upgrade because of this.
What actually happens, is that we're trying to run the analyze_new_cluster.sh
script that pg_upgrade
generates here and catch any errors.
An initial test run of gitlab-ctl pg-upgrade
(with -V 11 as I was testing on 12.8) doesn't show any errors, however the following query doesn't show any last_analyzed either:
gitlabhq_production=# SELECT count(*) FROM pg_stat_user_tables WHERE last_analyzed IS NOT NULL;
count
-------
0
(1 row)
Next attempt, revert the pg-upgrade, and modify /opt/gitlab/embedded/service/omnibus-ctl/lib/gitlab_ctl/util.rb
as follows, to log the output:
@@ -11,6 +11,8 @@ module GitlabCtl
def get_command_output(command, user = nil, timeout = nil)
begin
shell_out = run_command(command, live: false, user: user, timeout: timeout)
+ $stdout.print shell_out.stdout
+ $stdout.print shell_out.stderr
shell_out.error!
rescue Mixlib::ShellOut::ShellCommandFailed
raise GitlabCtl::Errors::ExecutionError.new(
And ta-da:
vacuumdb: could not connect to database template1: could not connect to server: No such file or directory
Is the server running locally and accepting
connections on Unix domain socket "/tmp/.s.PGSQL.5432"?
Looks like the analyze_new_cluster.sh
is trying to use some defaults, and it doesn't run, it fails silently and we don't catch this.
Manually running this (with the added -h
) works:
su - gitlab-psql -c '"/opt/gitlab/embedded/postgresql/11/bin/vacuumdb" --all --analyze-in-stages -h /var/opt/gitlab/postgresql
And afterwards the same query above shows the tables got analyzed:
gitlabhq_production=# SELECT count(*) FROM pg_stat_user_tables WHERE last_analyze IS NOT NULL;
count
-------
317
(1 row)
Steps to reproduce
Run gitlab-ctl pg-upgrade
.
Details of package version
To reproduce, I've used in order, with the same outcome, on Debian 9 (reverting the upgrade between versions):
- 12.8.10-ee.0
- upgrade to 12.9.5-ee.0
- upgrade to 12.10.3-ee.0 with
/etc/gitlab/disable-postgresql-upgrade
(so that I can manually add the logging and test)
Environment details
Tested on a single Omnibus node in GCP, fresh install on Debian 9.
Configuration details
Default setup, nothing configured except external_url
.
References
- Large Premium customer in ZD emergency ticket 155091 (internal-only)
- Large Premium customer in ZD ticket 156558 (internal-only)