Investigate the Geo update process using OmniBus

TL;DR

The main purpose of this effort is to follow the documentation of the Geo update process from one entry point discovery to a successful update and discover any pain points.

In order to do so, I upgraded a simple Geo setup from 11.11.5 to 12.0.3

The process is cumbersome and involves multiple steps; however, it is linear and I executed it without error on the first try.

Next steps

Determine a technical solution to reduce the number of individual steps needed for a Geo upgrade. Currently 8 steps are needed for the primary and secondary https://gitlab.com/gitlab-org/gitlab-ee/issues/12746
Create issue to improve ambiguous parts of the documentation
Determine where instructions are located for an HA and/or zero downtime deployment for Geo
Determine if I accidentally followed the HA/Zero downtime instructions?

Setup scenario

Simple deployment with a primary and a single secondary ( I used @vsizov's ansible scripts)

graph LR;
     fz-geo-primary --> fz-geo-secondary

No HA
Downtime updates
Using OmniBus packages from a repository (not downloaded manually) on Ubuntu

Following the documentation

I entered the documentation by browsing to https://docs.gitlab.com/ and searching for Geo update in the central search bar. The first relevant hit brought me to https://docs.gitlab.com/ee/administration/geo/replication/updating_the_geo_nodes.html#general-update-steps . This is good because I could immediately continue.

Given that my installed GitLab version is 11.11.5, I ignored the long list of old instructions.

all you need to do is update GitLab itself:

This is not actually correct. I currently have to also update the tracking database on secondary nodes (see https://gitlab.com/gitlab-org/gitlab-ee/issues/12270) and test things. We may consider rephrasing this.

Log into each node (primary and secondary nodes).

I assume this means "to a console on your node". This can mean I have to open as many terminal sessions as there are Geo nodes. For large installations, this suggestion may cause confusion. Maybe this can be rephrased or removed as it is replicated later

Update GitLab.

I followed the link to the documentation here and arrived at https://docs.gitlab.com/ee/update/README.html I followed the Omnibus package installation instructions, which are somewhere else https://docs.gitlab.com/omnibus/update/README.html

Here I was confused by the structure of the page. There are separate instructions for Zero downtime updates, a single deployment, Multi-node HA and Geo deployment.

Question: What about a combination of these things? What is the path for an HA deployment with Geo enabled and Zero downtime? Is this even supported?

Answer: The indentation for the documentation was wrong. These instructions are for zero downtime deployments. There is currently no dedicated documentation for HA Geo (it means different things to different people)

Resolution: Fixed in: omnibus-gitlab!3431 (merged)

I assumed that in my setup (see above) the Geo deployment instructions were what I was looking for. This felt strange as the statement in the Geo documentation suggested that I just have to update Gitlab - but now this is Geo specific?

Question: Is this true? I have a feeling these instructions may be for HA/Zero downtime but I am unsure because it is a top-level documentation entry.

Answer: These instructions are only relevant for zero downtime deployments. You can otherwise use the instructions for https://docs.gitlab.com/omnibus/update/README.html#updating-using-the-official-repositories in a simple setup.

While browsing the instructions I realized that the first step is to update the primary and then a secondary. I don't think there was a need to open all terminal windows in parallel?

Resolution: Opened https://gitlab.com/gitlab-org/gitlab-ee/issues/12773 and will be addressed in https://gitlab.com/gitlab-org/gitlab-ce/merge_requests/32212

Primary

Ensure that gitlab_rails['auto_migrate'] = false is set in /etc/gitlab/gitlab.rb

Why? Maybe a link to https://docs.gitlab.com/omnibus/settings/database.html#disabling-automatic-database-migration?

Question: If this is always required for updates, why not change it automatically? (non Geo specific)

Answer: (Toon) As stated above, only for zero-down time.

Resolution: Also clarified in https://gitlab.com/gitlab-org/gitlab-ce/merge_requests/32212.

Question: What about after the installation? Do I need to reset this setting?

Answer: ~~Unclear at this moment.~~

Answer: (Toon) It does not reset itself, cause it assumes you'll want to do zero-down time upgrades every time.

Resolution: To be addressed in omnibus-gitlab#4649 (closed)

Create an empty file at /etc/gitlab/skip-auto-reconfigure. During software installation only, this will prevent the upgrade from running gitlab-ctl reconfigure and automatically running database migrations

I think this is outdated as per %12.0 and needs updating https://docs.gitlab.com/omnibus/update/gitlab_12_changes.html#removal-of-support-for-etcgitlabskip-auto-migrations-file

Answer: (Toon) Not true, they just renamed the filename cause the purpose became slightly different.

Resolution: NOOP

Update the GitLab package

No questions, this looks simple enough.

To get the database migrations in place, run SKIP_POST_DEPLOYMENT_MIGRATIONS=true sudo gitlab-ctl reconfigure

Question: Why do I need SKIP_POST_DEPLOYMENT_MIGRATIONS?

Answer: (Toon) Because in a zero-down time upgrade you want to have control over when things happen.

Resolution: NOOP

Run non post-deployment database migrations SKIP_POST_DEPLOYMENT_MIGRATIONS=true sudo gitlab-rake db:migrate

Question: This command does something different depending on passing this variable? see 6

Answer: (Toon) There are 2 types of migrations. Those you can do while an older version is running, and those should run after/while the new version is running.

Resolution: NOOP

Run post-deployment database migrations sudo gitlab-rake db:migrate

This feels repetitive and I needed to read carefully that six is post deployment (now the SKIP_POST_DEPLOYMENT_MIGRATIONS makes sense).

Question: Could 4-6 be simplified into one step?

Answer: Probably yes. See https://gitlab.com/gitlab-org/gitlab-ee/issues/12746

Answer: (Toon) See answer at 5.

Resolution: NOOP

Hot reload unicorn and sidekiq services

I don't have context on this, but I am just doing it :)

Question: (Toon) I'm not sure why this happens after doing the post-deploy migrations?

Resolution: Opened omnibus-gitlab#4637 (closed)

Verify Geo configuration and dependencies

Great tool! Can't we run this automatically every time Geo is updated? I would like to see the Geo role included in the rake task!

Secondary

Almost all comments apply as well.

Question: Why do I not need to run non-post-deployment migrations? Why are they not called just deployment migrations?

Answer: (Toon) I think SKIP_POST_DEPLOYMENT_MIGRATIONS=true sudo gitlab-rake db:migrate on the primary is not needed. It should be done by SKIP_POST_DEPLOYMENT_MIGRATIONS=true sudo gitlab-ctl reconfigure.

Resolution: Will be addressed in https://gitlab.com/gitlab-org/gitlab-ee/issues/12746

I almost completely forgot by now that I still have to do things for the secondary. The sudo gitlab-rake geo:status also reports healthy already?

I looked at the initial documentation again and realised that I already ran that command? The same applies for checking the Geo status

Question: What does "you might need to run migrations on the tracking database" mean? When do I need to? How do I know?

Answer: (Toon) A primary has only 1 db: the main one. It has read-write to that db, so it should perform the migration. A secondary has a read-only replica of that database, so when the primary does the migrations, they get replicated, and thus the secondary doesn't have to do migration on the main db. But it has a second db, the tracking database. It's a db specific to that secondary only, and therefore it needs to make sure all migrations on db are done. The process is very similar to that on the primary, it's just a db of different data. The omnibus package should know by itself whether the node is primary or secondary, and take the right steps based on that.

Resolution: Addressed in https://gitlab.com/gitlab-org/gitlab-ee/issues/12270

Results

The process overall is pretty linear and other than being a lot of copy-paste relatively easy to follow. The documentation flow should be improved and the structure confused me at times. Particularly, I am not sure if the Geo deployment is only relevant to HA/Zero downtime deployments? Given that the steps listed in General update steps appear to be already executed when following the Geo deployment instructions, this might be the case.

Edited Aug 28, 2019 by Toon Claes