Geo promote-to-primary-node invalid options in 13.1-13.4 on XFS
Summary
Because pre- !4646 (merged), get_ctl_options
existed in two places globally, loading the files in a different order would override other's definition.
This was noticed by a Premium customer during a Geo failover on GitLab v13.3.6-ee.
We couldn't reproduce initially on an Ubuntu instance, but a base CentOS GCP install reproduced the error:
[root@cat-geotest2 ~]# gitlab-ctl promote-to-primary-node --skip-preflight-checks
Traceback (most recent call last):
6: from /opt/gitlab/embedded/bin/omnibus-ctl:23:in `<main>'
5: from /opt/gitlab/embedded/bin/omnibus-ctl:23:in `load'
4: from /opt/gitlab/embedded/lib/ruby/gems/2.6.0/gems/omnibus-ctl-0.6.0/bin/omnibus-ctl:31:in `<top (required)>'
3: from /opt/gitlab/embedded/lib/ruby/gems/2.6.0/gems/omnibus-ctl-0.6.0/lib/omnibus-ctl.rb:746:in `run'
2: from /opt/gitlab/embedded/lib/ruby/gems/2.6.0/gems/omnibus-ctl-0.6.0/lib/omnibus-ctl.rb:204:in `block in add_command_under_category'
1: from /opt/gitlab/embedded/service/omnibus-ctl-ee/promote_to_primary_node.rb:21:in `block in load_file'
/opt/gitlab/embedded/service/omnibus-ctl-ee/promotion_preflight_checks.rb:15:in `get_ctl_options': invalid option: --skip-preflight-checks (OptionParser::InvalidOption)
The difference wasn't obvious, however comparing the straces shows they load in a different order:
Ubuntu:
8418 15:55:12.040236 execve("/opt/gitlab/embedded/bin/omnibus-ctl", ["/opt/gitlab/embedded/bin/omnibus-ctl", "gitlab", "/opt/gitlab/embedded/service/omnibus-ctl*", "promote-to-primary-node", "--skip-preflight-checks"], 0x5635d71bb050 /* 21 vars */ <unfinished ...>
8418 15:55:20.725007 read(6</opt/gitlab/embedded/service/omnibus-ctl-ee/promotion_preflight_checks.rb>
8418 15:55:20.785545 read(6</opt/gitlab/embedded/service/omnibus-ctl-ee/lib/geo/promotion_preflight_checks.rb>
8418 15:55:20.850128 read(6</opt/gitlab/embedded/service/omnibus-ctl-ee/promote_to_primary_node.rb>,
8418 15:55:21.134812 read(6</opt/gitlab/embedded/service/omnibus-ctl-ee/lib/geo/promote_to_primary_node.rb>
CentOS:
137642 15:51:03.809356 execve("/opt/gitlab/embedded/bin/omnibus-ctl", ["/opt/gitlab/embedded/bin/omnibus-ctl", "gitlab", "/opt/gitlab/embedded/service/omnibus-ctl*", "promote-to-primary-node", "--skip-preflight-checks"], 0x55844f39e550 /* 21 vars */) = 0 <0.000309>
137642 15:51:11.188369 read(6</opt/gitlab/embedded/service/omnibus-ctl-ee/promote_to_primary_node.rb>
137642 15:51:11.220625 read(6</opt/gitlab/embedded/service/omnibus-ctl-ee/lib/geo/promote_to_primary_node.rb>
137642 15:51:11.241745 read(6</opt/gitlab/embedded/service/omnibus-ctl-ee/promotion_preflight_checks.rb>
137642 15:51:11.279924 read(6</opt/gitlab/embedded/service/omnibus-ctl-ee/lib/geo/promotion_preflight_checks.rb>
It's clear at this point that on CentOS the promotion_preflight_checks file loads last, which overrides the get_ctl_options
of the promote_to_primary_node
. It's not clear why, yet, though.
Calling the getdents
syscall directly, apparently proves this is different on the systems:
Ubuntu:
root@cat-geotest1:/opt/gitlab/embedded/service/omnibus-ctl-ee# /tmp/getdents
--------------- nread=488 ---------------
inode# file type d_reclen d_off d_name
537181 directory 24 1352499008977483534 ..
1052591 regular 56 2504495246694028984 promotion_preflight_checks.rb
1052590 regular 48 2944828851944531620 promote_to_primary_node.rb
1052588 regular 32 3855910538002281433 patroni.rb
1052580 regular 32 5380097782668342599 consul.rb
1052579 directory 24 5675545248176283607 .
1052593 regular 32 6203997677802995374 repmgr.rb
1052594 regular 48 6478499198533784667 set_geo_primary_node.rb
1052582 regular 48 6794371002025137889 get_postgresql_primary.rb
1052589 regular 32 6808389166771206748 pgbouncer.rb
1052583 directory 24 7253071941143058311 lib
1052581 regular 40 7719487038616488496 geo_replication.rb
1052592 regular 48 9223372036854775807 replicate_geo_database.rb
CentOS:
[root@cat-geotest2 omnibus-ctl-ee]# /tmp/getdents
--------------- nread=488 ---------------
inode# file type d_reclen d_off d_name
51032921 directory 24 10 .
1437530 directory 24 16 ..
51032922 regular 32 24 consul.rb
51032923 regular 40 34 geo_replication.rb
51032924 regular 48 39 get_postgresql_primary.rb
2961359 directory 24 45 lib
51032925 regular 32 53 patroni.rb
51032926 regular 32 62 pgbouncer.rb
51032927 regular 48 74 promote_to_primary_node.rb
51032928 regular 56 86 promotion_preflight_checks.rb
51032929 regular 48 95 replicate_geo_database.rb
51032930 regular 32 104 repmgr.rb
51032931 regular 48 512 set_geo_primary_node.rb
So this explains the different load order, but why is the order different ?! Turns out, the default filesystems are: Ubuntu - ext4; CentOS - XFS.
Creating a test XFS partition locally on the Ubuntu instance reproduces the issue as well:
Steps to reproduce
dd if=/dev/zero of=xfstest count=999999
/sbin/mkfs -t xfs -q xfstest
mkdir testmount
mount -o loop=/dev/loop4 /tmp/xfstest /tmp/testmount
cp -pr /opt/gitlab/embedded/service/omnibus-ctl-ee testmount/
# now, to see the difference:
cd /tmp/testmount/omnibus-ctl-ee && /tmp/getdents
/opt/gitlab/embedded/service/omnibus-ctl-ee && /tmp/getdents
Where /tmp/getdents
is the compiled version of a .c binary calling the getdents syscall, like for example, the one in the man-pages.
What is the current bug behavior?
Files get load in a different order, resulting in --force
and --skip-preflight-checks
options not working on XFS
(and possibly other) filesystems.
What is the expected correct behavior?
--force
and --skip-preflight-checks
options should work
Note: this was fixed in !4646 (merged) in 13.5, but we should add it to the docs, and the current workaround is to run the preflight checks manually and promote the DB, i.e.:
gitlab-ctl promotion-preflight-checks
/opt/gitlab/embedded/bin/gitlab-pg-ctl promote
gitlab-ctl reconfigure
gitlab-rake geo:set_secondary_as_primary