2018-05-17: staging failover attempt

T minus 1 day (2018-05-16)

Create the QA testing issue using the template: #441 (closed)
Perform Pages Azure-to-GCP rsync
- Validate lsyncd state
Perform Preflight Checklist: #439 (closed)
PRODUCTION ONLY UNTESTED Update GitLab shared runners to expire jobs after 1 hour

T minus 1 hour (2018-05-17 12:00 UTC)

STAGING FAILOVER TESTING ONLY: to speed up testing, this step can be done less than 1 hour before failover

GitLab runners attempting to post artifacts back to GitLab.com during the maintenance window will fail and the artifacts may be lost. To avoid this as much as possible, we'll block any new runner jobs from runner, starting an hour before the scheduled maintenance window.

Stop any new GitLab CI jobs from being executed
- Block POST /api/v4/jobs/request

T minus zero (failover day) (Date TBD)

Failover Procedure

Notify users of scheduled maintenance

Create a broadcast message
- Navigate to https://staging.gitlab.com/admin/broadcast_messages
- Text: staging.gitlab.com is moving to a new home! Hold onto your hats, we’re going dark for approximately 1 hour from XX:XX on 2018-XX-YY
- Start date: now. End date: expected end of the failover window

STAGING FAILOVER TESTING ONLY Snapshot staging machines

Staging is a multi-use environment, and we want to practice failover multiple times with as little friction as possible. Taking a backup of the database will allow us to recover from the most likely errors without having to rebuild the whole enviornment

Snapshot the database disks on Azure
Snapshot the database disks on GCP

We should do this asynchronously, but as close to the start of the failover window as possible. E.g., if the failover is planned for 1pm UTC, perhaps at 12:50PM UTC.

Prevent updates to the primary

Update Azure NSG (network security groups) to drop non-VPN traffic:
- https://gitlab.com/gitlab-com/gitlab-com-infrastructure/merge_requests/349
Ensure traffic from a non-VPN IP is blocked
1. SSH: gitlab-rake gitlab:tcp_check[staging.gitlab.com,22]
2. HTTP: gitlab-rake gitlab:tcp_check[staging.gitlab.com,80]
3. HTTPS: gitlab-rake gitlab:tcp_check[staging.gitlab.com,443]
4. *PRODUCTION ONLY UNTESTED AltSSH: gitlab-rake gitlab:tcp_check[altssh.gitlab.com,443]
Running CI jobs will no longer be able to push updates. Jobs that complete now may be lost.
Disable Sidekiq crons that may cause updates on the primary
1. Navigate to https://staging.gitlab.com/admin/background_jobs / https://gitlab.com/admin/background_jobs
2. Press Cron -> Disable all
3. Enable geo_metrics_update_worker, geo_prune_event_log_worker and geo_repository_verification_primary_batch_worker
Wait for all Sidekiq jobs to complete on the primary
1. Navigate to https://staging.gitlab.com/admin/background_jobs / https://gitlab.com/admin/background_jobs
2. Press Queues -> Live Poll
3. Wait for all queues not mentioned above to reach 0
4. Wait for the number of Busy jobs to reach 0

Finish replicating and verifying all data

At this point all data on the primary should be present in exactly the same form on the secondary. There is no outstanding work in sidekiq on the primary or secondary, and if we failover, no data will be lost.

Stopping all cronjobs on the secondary means it will no longer attempt to run background synchronization operations against the primary, reducing the chance of errors while it is being promoted.

Promote the secondary

Gracefully turn off the Azure postgresql instances.
- Keep everything, just ensure it’s turned off
- gitlab-rake stop postgresql
Trigger the postgresql failover, making the read-only replica in gstg / gprd read-writeable
- Staging:
  1. sudo /opt/gitlab/embedded/bin/gitlab-pg-ctl promote
- Production:
  1. Repmgr process for production: #349 (closed)
  2. This should not turn the old primary postgresql servers into followers
Check the database is now read-write
1. SQL, looking for F as the result: select * from pg_is_in_recovery();
Run gitlab-rake geo:set_secondary_as_primary on one of the gstg / gprd nodes
Update the chef configuration according to https://dev.gitlab.org/cookbooks/chef-repo/merge_requests/1989
Run chef-client on every node to ensure Chef changes are applied and all Geo secondary services are stopped
- STAGING knife ssh roles:gstg-base 'sudo chef-client'
- PRODUCTION UNTESTED knife ssh roles:gprd-base 'sudo chef-client'

During-Blackout QA

All "during the blackout" QA automated tests have succeeded - @meks
All "during the blackout" QA manual tests have succeeded - @meks

Complete the Migration

Update the staging.gitlab.com DNS entries to refer to the GCP load-balancer
Remove the broadcast message
PRODUCTION ONLY UNTESTED Re-enable mailing queues on sidekiq-asap (revert chef-repo!1922)
1. admin_emails queue
2. emails_on_push queue
3. mailers queue
Ensure all "after the blackout" QA automated tests have succeeded - @meks
Ensure all "after the blackout" QA manual tests have succeeded - @meks

STAGING ONLY Failback, discarding changes made to GCP

Since staging is multi-use and we want to run the failover multiple times anyway, we need these steps.

In the event of discovering a problem doing the failover on GitLab.com "for real" (i.e. before opening it up to the public), it will also be super-useful to have this documented and tested.

Edited May 17, 2018 by Alejandro Rodríguez