Geo: Some questions on the order of running commands in zero-downtime upgrades

@fzimmer The plot, it finally thickens... 🤔

In https://gitlab.com/gitlab-org/gitlab-ee/issues/12625 we didn't fully understand why we needed to run gitlab-rake db:migrate on the primary. But after reading the multi-node / HA deployment instructions again I understand why: 🔜

Because the user was asked to add gitlab_rails['auto_migrate'] = false to /etc/gitlab/gitlab.rb. 💡

But I do not understand why this is needed in the first place? 😕

To understand this, let's do analysis by looking what happens with HA setups 🔬

In multi-node / HA deployment

You pick one deploy node. This node is used to decide when migrations run. 👑

On that deploy node, you have to create /etc/gitlab/skip-auto-reconfigure to make sure during apt upgrade gitlab the migrations are not run (indirectly through gitlab-ctl reconfigure). ✔

On the other nodes, you need to set gitlab_rails['auto_migrate'] = false in /etc/gitlab/gitlab.rb. From migrations point-of-view this has the same effect, migrations do not run automatically on apt upgrade gitlab. ✔

Next you run apt upgrade gitlab and SKIP_POST_DEPLOYMENT_MIGRATIONS=true sudo gitlab-ctl reconfigure on the deploy node, this triggers only the pre-deployment migrations. 👍

Then upgrade all the other nodes with apt upgrade gitlab. These nodes run gitlab-ctl reconfigure (not sure why the command needs to be triggered manually), but do not perform any migrations. 😶

The last step is doing post-deployment migrations on the deploy node. 🏁

In Geo (single-machine) deployment

One can think of Geo as a sort of HA setup. So you need a deploy node: the Geo Primary node. 🌎

I understand you do not want the migrations to run automatically on apt upgrade gitlab, but IMHO only doing one of both is needed: 🎛

Create /etc/gitlab/skip-auto-reconfigure
Set gitlab_rails['auto_migrate'] = false in /etc/gitlab/gitlab.rb

Since the primary node is considered the equivalent as a deploy node, I'd say only the former. 👑

On the secondary nodes, you could think they need to add gitlab_rails['auto_migrate'] = false in /etc/gitlab/gitlab.rb, but no, Omnibus already knows it's a Geo secondary, so that setting is implied. 🔮

Questions

Why does `/etc/gitlab/skip-auto-reconfigure` exist anyway?

On the deploy node we could also:

Set gitlab_rails['auto_migrate'] = false in /etc/gitlab/gitlab.rb
Run apt upgrade gitlab
Run SKIP_POST_DEPLOYMENT_MIGRATIONS=true gitlab-rake db:migrate (instead of SKIP_POST_DEPLOYMENT_MIGRATIONS=true sudo gitlab-ctl reconfigure)

Seems an equal amount of steps? It might be related to when gitlab-ctl reconfigure stops the unicorn and sidekiq processes, but I'm not sure. 🐙

If /etc/gitlab/skip-auto-reconfigure is the way to go for HA, Geo should do the same. 💯

Why hot reload `unicorn` and `sidekiq` after post-deployment migrations and not before?

In my understanding, pre-deployment migrations can be done while the old code is still running. And post-deployment migrations are done while the new code is running.

For example, looking at the documentation on dropping columns. In step one we add ignore_column column to make sure the code will not look at the column, and at the same time a post-deployment migration is added to actually DROP the column. But in my understanding that means the code with the ignore_column statement needs to be running while the post-deploy migration is dropping the column. And by doing sudo gitlab-ctl hup unicorn after running the post-deployment migrations, that is not the case.

Proposed actions

Remove gitlab_rails['auto_migrate'] = false from the Geo instructions
Only ask the user to run SKIP_POST_DEPLOYMENT_MIGRATIONS=true sudo gitlab-ctl reconfigure on the primary, gitlab-rake db:migrate is not needed

Edited Sep 30, 2019 by Toon Claes