2018-05-17: staging failover attempt
T minus 1 day (2018-05-16)
-
Create the QA testing issue using the template: #441 (closed) -
Perform Pages Azure-to-GCP rsync - Validate lsyncd state
-
Perform Preflight Checklist: #439 (closed) -
PRODUCTION ONLY UNTESTED Update GitLab shared runners to expire jobs after 1 hour
T minus 1 hour (2018-05-17 12:00 UTC)
STAGING FAILOVER TESTING ONLY: to speed up testing, this step can be done less than 1 hour before failover
GitLab runners attempting to post artifacts back to GitLab.com during the maintenance window will fail and the artifacts may be lost. To avoid this as much as possible, we'll block any new runner jobs from runner, starting an hour before the scheduled maintenance window.
-
Stop any new GitLab CI jobs from being executed - Block
POST /api/v4/jobs/request
- Block
T minus zero (failover day) (Date TBD)
Failover Procedure
Notify users of scheduled maintenance
-
Create a broadcast message - Navigate to https://staging.gitlab.com/admin/broadcast_messages
- Text:
staging.gitlab.com is moving to a new home! Hold onto your hats, we’re going dark for approximately 1 hour from XX:XX on 2018-XX-YY - Start date: now. End date: expected end of the failover window
STAGING FAILOVER TESTING ONLY Snapshot staging machines
Staging is a multi-use environment, and we want to practice failover multiple times with as little friction as possible. Taking a backup of the database will allow us to recover from the most likely errors without having to rebuild the whole enviornment
-
Snapshot the database disks on Azure -
Snapshot the database disks on GCP
We should do this asynchronously, but as close to the start of the failover window as possible. E.g., if the failover is planned for 1pm UTC, perhaps at 12:50PM UTC.
Prevent updates to the primary
-
Update Azure NSG (network security groups) to drop non-VPN traffic: -
Ensure traffic from a non-VPN IP is blocked - SSH:
gitlab-rake gitlab:tcp_check[staging.gitlab.com,22] - HTTP:
gitlab-rake gitlab:tcp_check[staging.gitlab.com,80] - HTTPS:
gitlab-rake gitlab:tcp_check[staging.gitlab.com,443] - *PRODUCTION ONLY UNTESTED AltSSH:
gitlab-rake gitlab:tcp_check[altssh.gitlab.com,443]
- SSH:
- Running CI jobs will no longer be able to push updates. Jobs that complete now may be lost.
-
Disable Sidekiq crons that may cause updates on the primary - Navigate to https://staging.gitlab.com/admin/background_jobs / https://gitlab.com/admin/background_jobs
- Press
Cron -> Disable all - Enable
geo_metrics_update_worker,geo_prune_event_log_workerandgeo_repository_verification_primary_batch_worker
-
Wait for all Sidekiq jobs to complete on the primary - Navigate to https://staging.gitlab.com/admin/background_jobs / https://gitlab.com/admin/background_jobs
- Press
Queues -> Live Poll - Wait for all queues not mentioned above to reach 0
- Wait for the number of
Busyjobs to reach 0
Finish replicating and verifying all data
-
Ensure any data not replicated by Geo is replicated manually. We know about these: -
Container Registry - Hopefully this is a shared object storage bucket, in which case this can be removed
-
GitLab Pages - Check that lsyncd is up to date? Run rsync command?
-
CI traces in Redis - Run
::Ci::BuildTraceChunk.redis.find_each(batch_size: 10, &:use_database!)
- Run
-
-
Navigate to https://gstg.gitlab.com/admin/geo_nodes or https://gprd.gitlab.com/admin/geo_nodes -
Wait for all repositories and wikis to become synchronized - Press "Sync Information"
- Wait for "repositories synced" and "wikis synced" to reach 100% with 0 failures
- If failures appear, see Rails console commands to resync repos/wikis: https://gitlab.com/snippets/1713152
-
Wait for all repositories and wikis to become verified - Press "Verification Information"
- Wait for "repositories verified" and "wikis verified" to reach 100% with 0 failures
- If failures appear, see https://gitlab.com/snippets/1713152#verify-repos-after-successful-sync for how to manually verify after resync
-
In "Sync Information", wait for "Data replication lag" to read 1mor less -
In "Sync Information", wait for "Last event ID seen from primary" to equal "Last event ID processed by cursor" -
Wait for all Sidekiq jobs to complete on the secondary - Navigate to https://gstg.gitlab.com/admin/background_jobs / https://gprd.gitlab.com/admin/background_jobs
- Press
Queues -> Live Poll - Wait for all queues to reach 0
- Wait for the number of
Busyjobs to reach 0
-
Now disable all sidekiq-cron jobs on the secondary - Navigate to https://gstg.gitlab.com/admin/background_jobs / https://gprd.gitlab.com/admin/background_jobs
- Press
Cron - Press
Disable all
At this point all data on the primary should be present in exactly the same form on the secondary. There is no outstanding work in sidekiq on the primary or secondary, and if we failover, no data will be lost.
Stopping all cronjobs on the secondary means it will no longer attempt to run background synchronization operations against the primary, reducing the chance of errors while it is being promoted.
Promote the secondary
-
Gracefully turn off the Azure postgresql instances. - Keep everything, just ensure it’s turned off
gitlab-rake stop postgresql
-
Trigger the postgresql failover, making the read-only replica in gstg/gprdread-writeable-
Staging:
sudo /opt/gitlab/embedded/bin/gitlab-pg-ctl promote
-
Production:
- Repmgr process for production: #349 (closed)
- This should not turn the old primary postgresql servers into followers
-
Staging:
-
Check the database is now read-write - SQL, looking for
Fas the result:select * from pg_is_in_recovery();
- SQL, looking for
-
Run gitlab-rake geo:set_secondary_as_primaryon one of thegstg/gprdnodes -
Update the chef configuration according to https://dev.gitlab.org/cookbooks/chef-repo/merge_requests/1989 -
Run chef-clienton every node to ensure Chef changes are applied and all Geo secondary services are stopped-
STAGING
knife ssh roles:gstg-base 'sudo chef-client' -
PRODUCTION UNTESTED
knife ssh roles:gprd-base 'sudo chef-client'
-
STAGING
Pages
-
DNS CHANGES: In Route 53, point *.githost.io to new GCP Pages LB -
STOP SERVING PAGES AT OLD IP AND START PROXY SERVICE TO NEW UP: Modify Azure Pages LB Nodes -
Complete the MR at https://dev.gitlab.org/cookbooks/chef-repo/merge_requests/1987 -
Complete a chef-client run on the gitlab-base-lb-pagesrole
-
During-Blackout QA
-
All "during the blackout" QA automated tests have succeeded - @meks -
All "during the blackout" QA manual tests have succeeded - @meks
Complete the Migration
-
Update the staging.gitlab.comDNS entries to refer to the GCP load-balancer -
Remove the broadcast message -
PRODUCTION ONLY UNTESTED Re-enable mailing queues on sidekiq-asap (revert chef-repo!1922) -
admin_emailsqueue -
emails_on_pushqueue -
mailersqueue
-
-
Ensure all "after the blackout" QA automated tests have succeeded - @meks -
Ensure all "after the blackout" QA manual tests have succeeded - @meks
STAGING ONLY Failback, discarding changes made to GCP
Since staging is multi-use and we want to run the failover multiple times anyway, we need these steps.
In the event of discovering a problem doing the failover on GitLab.com "for real" (i.e. before opening it up to the public), it will also be super-useful to have this documented and tested.
-
Re-add geo_secondary_role['enable'] = trueon every gstg node -
Run gitlab-ctl reconfigure on every changed gstg node -
Update the staging.gitlab.comDNS entries to refer to the Azure load-balancer -
Failover went well -
Revert the postgresql failover, so the data on the stopped primary staging nodes becomes canonical again and the secondary staging nodes replicate from it -
Delete the snapshots
-
-
Failover went badly -
Restore the database nodes from snapshot
-
-
Turn on the azure environment -
Enable access to the azure environment from the outside world -
Re-enable cronjobs on the primary - Navigate to https://staging.gitlab.com/admin/background_jobs, press "Cron"
- Find the
geo_sidekiq_cron_config_workerrow and press "Enable" on it - All but the Geo-secondary-only queues will be re-enabled
-
Ensure rake gitlab:geo:checkpasses on both primary and secondary