5 minute GitLab pages outage during deploy

As part of deployment we issue a number of restarts:

bundle exec knife ssh -e -a fqdn roles:gprd-base-fe-api sudo apt-get install -y -q --force-yes gitlab-ee=1 && sudo gitlab-ctl hup unicorn
bundle exec knife ssh -e -a fqdn roles:gprd-base-fe-web sudo apt-get install -y -q --force-yes gitlab-ee=1 && sudo gitlab-ctl hup unicorn
bundle exec knife ssh -e -a fqdn roles:gprd-base-be-sidekiq sudo apt-get install -y -q --force-yes gitlab-ee=1 && sudo gitlab-ctl hup unicorn
bundle exec knife ssh -e -a fqdn roles:gprd-base-fe-registry sudo apt-get install -y -q --force-yes gitlab-ee=1
bundle exec knife ssh -e -a fqdn roles:gprd-base-be-mailroom sudo apt-get install -y -q --force-yes gitlab-ee=1
bundle exec knife ssh -e -a fqdn roles:gprd-base-fe-git sudo apt-get install -y -q --force-yes gitlab-ee=1 && sudo gitlab-ctl hup unicorn
bundle exec knife ssh -e -a fqdn roles:gprd-base-fe-api sudo gitlab-ctl restart gitlab-workhorse
bundle exec knife ssh -e -a fqdn roles:gprd-base-fe-web sudo gitlab-ctl restart deploy-page
bundle exec knife ssh -e -a fqdn roles:gprd-base-be-mailroom sudo gitlab-ctl restart
bundle exec knife ssh -e -a fqdn roles:gprd-base-be-sidekiq sudo gitlab-ctl restart nginx
bundle exec knife ssh -e -a fqdn roles:gprd-base-fe-git sudo gitlab-ctl restart gitlab-workhorse
bundle exec knife ssh -e -a fqdn roles:gprd-base-fe-api sudo gitlab-ctl restart nginx
bundle exec knife ssh -e -a fqdn roles:gprd-base-fe-web sudo gitlab-ctl restart gitlab-pages
bundle exec knife ssh -e -a fqdn roles:gprd-base-fe-git sudo gitlab-ctl restart nginx
bundle exec knife ssh -e -a fqdn roles:gprd-base-fe-api sudo gitlab-ctl restart registry
bundle exec knife ssh -e -a fqdn roles:gprd-base-fe-web sudo gitlab-ctl restart gitlab-workhorse
bundle exec knife ssh -e -a fqdn roles:gprd-base-fe-web sudo gitlab-ctl restart nginx

I'm sure there are historical reasons for adding these restarts at the end of deployment but are they still necessary? ~~Why can we not let the omnibus handle restarts when they are required?~~ edit: restarts are only handled by takeoff

Sep  5 09:06:02 web-01-sv-gprd sudo:      bvl : TTY=pts/0 ; PWD=/home/bvl ; USER=root ; COMMAND=/usr/bin/gitlab-ctl restart gitlab-pages

In this particular case restarting pages means that it will fail the healthcheck until all pages on disk are loaded which can take awhile, this is why the pages service went down for 5 minutes.

Edited Sep 05, 2018 by John Jarvis