Skip to content

Geo promote-to-primary-node invalid options in 13.1-13.4 on XFS

Summary

Because pre- !4646 (merged), get_ctl_options existed in two places globally, loading the files in a different order would override other's definition.

This was noticed by a Premium customer during a Geo failover on GitLab v13.3.6-ee.

We couldn't reproduce initially on an Ubuntu instance, but a base CentOS GCP install reproduced the error:

[root@cat-geotest2 ~]# gitlab-ctl promote-to-primary-node --skip-preflight-checks
Traceback (most recent call last):
        6: from /opt/gitlab/embedded/bin/omnibus-ctl:23:in `<main>'
        5: from /opt/gitlab/embedded/bin/omnibus-ctl:23:in `load'
        4: from /opt/gitlab/embedded/lib/ruby/gems/2.6.0/gems/omnibus-ctl-0.6.0/bin/omnibus-ctl:31:in `<top (required)>'
        3: from /opt/gitlab/embedded/lib/ruby/gems/2.6.0/gems/omnibus-ctl-0.6.0/lib/omnibus-ctl.rb:746:in `run'
        2: from /opt/gitlab/embedded/lib/ruby/gems/2.6.0/gems/omnibus-ctl-0.6.0/lib/omnibus-ctl.rb:204:in `block in add_command_under_category'
        1: from /opt/gitlab/embedded/service/omnibus-ctl-ee/promote_to_primary_node.rb:21:in `block in load_file'
/opt/gitlab/embedded/service/omnibus-ctl-ee/promotion_preflight_checks.rb:15:in `get_ctl_options': invalid option: --skip-preflight-checks (OptionParser::InvalidOption)

The difference wasn't obvious, however comparing the straces shows they load in a different order:

Ubuntu:

8418  15:55:12.040236 execve("/opt/gitlab/embedded/bin/omnibus-ctl", ["/opt/gitlab/embedded/bin/omnibus-ctl", "gitlab", "/opt/gitlab/embedded/service/omnibus-ctl*", "promote-to-primary-node", "--skip-preflight-checks"], 0x5635d71bb050 /* 21 vars */ <unfinished ...>
8418  15:55:20.725007 read(6</opt/gitlab/embedded/service/omnibus-ctl-ee/promotion_preflight_checks.rb>
8418  15:55:20.785545 read(6</opt/gitlab/embedded/service/omnibus-ctl-ee/lib/geo/promotion_preflight_checks.rb>
8418  15:55:20.850128 read(6</opt/gitlab/embedded/service/omnibus-ctl-ee/promote_to_primary_node.rb>,
8418  15:55:21.134812 read(6</opt/gitlab/embedded/service/omnibus-ctl-ee/lib/geo/promote_to_primary_node.rb>

CentOS:

137642 15:51:03.809356 execve("/opt/gitlab/embedded/bin/omnibus-ctl", ["/opt/gitlab/embedded/bin/omnibus-ctl", "gitlab", "/opt/gitlab/embedded/service/omnibus-ctl*", "promote-to-primary-node", "--skip-preflight-checks"], 0x55844f39e550 /* 21 vars */) = 0 <0.000309>
137642 15:51:11.188369 read(6</opt/gitlab/embedded/service/omnibus-ctl-ee/promote_to_primary_node.rb>
137642 15:51:11.220625 read(6</opt/gitlab/embedded/service/omnibus-ctl-ee/lib/geo/promote_to_primary_node.rb>
137642 15:51:11.241745 read(6</opt/gitlab/embedded/service/omnibus-ctl-ee/promotion_preflight_checks.rb>
137642 15:51:11.279924 read(6</opt/gitlab/embedded/service/omnibus-ctl-ee/lib/geo/promotion_preflight_checks.rb>

It's clear at this point that on CentOS the promotion_preflight_checks file loads last, which overrides the get_ctl_options of the promote_to_primary_node. It's not clear why, yet, though.

Calling the getdents syscall directly, apparently proves this is different on the systems:

Ubuntu:

root@cat-geotest1:/opt/gitlab/embedded/service/omnibus-ctl-ee# /tmp/getdents
--------------- nread=488 ---------------
inode#    file type  d_reclen  d_off   d_name
  537181  directory    24 1352499008977483534  ..
 1052591  regular      56 2504495246694028984  promotion_preflight_checks.rb
 1052590  regular      48 2944828851944531620  promote_to_primary_node.rb
 1052588  regular      32 3855910538002281433  patroni.rb
 1052580  regular      32 5380097782668342599  consul.rb
 1052579  directory    24 5675545248176283607  .
 1052593  regular      32 6203997677802995374  repmgr.rb
 1052594  regular      48 6478499198533784667  set_geo_primary_node.rb
 1052582  regular      48 6794371002025137889  get_postgresql_primary.rb
 1052589  regular      32 6808389166771206748  pgbouncer.rb
 1052583  directory    24 7253071941143058311  lib
 1052581  regular      40 7719487038616488496  geo_replication.rb
 1052592  regular      48 9223372036854775807  replicate_geo_database.rb

CentOS:

[root@cat-geotest2 omnibus-ctl-ee]# /tmp/getdents
--------------- nread=488 ---------------
inode#    file type  d_reclen  d_off   d_name
51032921  directory    24         10  .
 1437530  directory    24         16  ..
51032922  regular      32         24  consul.rb
51032923  regular      40         34  geo_replication.rb
51032924  regular      48         39  get_postgresql_primary.rb
 2961359  directory    24         45  lib
51032925  regular      32         53  patroni.rb
51032926  regular      32         62  pgbouncer.rb
51032927  regular      48         74  promote_to_primary_node.rb
51032928  regular      56         86  promotion_preflight_checks.rb
51032929  regular      48         95  replicate_geo_database.rb
51032930  regular      32        104  repmgr.rb
51032931  regular      48        512  set_geo_primary_node.rb

So this explains the different load order, but why is the order different ?! Turns out, the default filesystems are: Ubuntu - ext4; CentOS - XFS.

Creating a test XFS partition locally on the Ubuntu instance reproduces the issue as well:

Steps to reproduce

dd if=/dev/zero of=xfstest count=999999
/sbin/mkfs -t xfs -q xfstest
mkdir testmount
mount -o loop=/dev/loop4 /tmp/xfstest /tmp/testmount
cp -pr /opt/gitlab/embedded/service/omnibus-ctl-ee testmount/

# now, to see the difference:
cd /tmp/testmount/omnibus-ctl-ee && /tmp/getdents
/opt/gitlab/embedded/service/omnibus-ctl-ee && /tmp/getdents

Where /tmp/getdents is the compiled version of a .c binary calling the getdents syscall, like for example, the one in the man-pages.

What is the current bug behavior?

Files get load in a different order, resulting in --force and --skip-preflight-checks options not working on XFS (and possibly other) filesystems.

What is the expected correct behavior?

--force and --skip-preflight-checks options should work

Note: this was fixed in !4646 (merged) in 13.5, but we should add it to the docs, and the current workaround is to run the preflight checks manually and promote the DB, i.e.:

gitlab-ctl promotion-preflight-checks
/opt/gitlab/embedded/bin/gitlab-pg-ctl promote
gitlab-ctl reconfigure
gitlab-rake geo:set_secondary_as_primary