2018-05-09: Staging failover attempt checklist

T minus 3 weeks (Date TBD) (PRODUCTION ONLY)

[-] Notify content team of upcoming announcements to give them time to prepare blog post, email content. Link to blog MR.

T minus 1 week (Date TBD) (PRODUCTION ONLY)

[-] Perform Preflight Checklist
[-] Andrew/Eliran to communicate date to Google
[-] Andrew to announce in #general slack and on team call date of failover.
[-] Marketing team publish blog post about upcoming GCP failover
[-] Marketing team sends out an email to all users notifying that GitLab.com will be undergoing scheduled maintenance. Email should include points on:
- Users should expect to have to re-authenticate after the outage, as authentication cookies will be invalidated after the failover.
- Details of our backup policies to assure users that their data is safe

T minus 1 day (2018-08-08)

Create the QA testing issue using the template: #397 (closed)
Perform Preflight Checklist: #394 (closed)

T minus zero (failover day) (2018-08-09)

Failover Procedure

Notify users of scheduled maintenance

Create a broadcast message along the lines of “staging.gitlab.com is moving to a new home! Hold onto your hats, we’re going dark for approximately 1 hour from XX:XX on 2018-XX-YY” On the secondary, clear the Redis cache to show the broadcast message there too: sudo gitlab-rake cache:clear:redis (FIX template: cache clear is unnecessary)

Snapshot staging machines (STAGING FAILOVER TESTING ONLY)

Staging is a multi-use environment, and we want to practice failover multiple times with as little friction as possible.
To facilitate these cases, we should take a snapshot of at least the database disks on the secondary before starting work. This will allow us to return to the previous state following a failover with a minimum of effort

Prevent updates to the primary

In Azure, prevent incoming traffic from everyone but those involved in the failover and the gstg/gprd GCP environment… (https://gitlab.com/gitlab-com/gitlab-com-infrastructure/merge_requests/349)
1. Update Azure NSG (network security groups) to drop external traffic.
Ensure git traffic from a third-party IP is blocked
Ensure HTTPS traffic from a third-party IP is blocked
Navigate to https://staging.gitlab.com/admin/background_jobs / https://gitlab.com/admin/background_jobs , press “Cron”, “Disable all”, then find the “geo_sidekiq_cron_config_worker” row and press “Enable” on it. A few more crons will self-enable after this point
Navigate to https://staging.gitlab.com/admin/background_jobs / https://gitlab.com/admin/background_jobs, press “Queues” then “Live Poll” , and wait for all non-geo queues to reach 0 (geo queues have geo in the name)
Running CI jobs will not be able to push updates at this point. If a job completes during the maintenance window, its status will be lost.

Finish replicating and verifying all data

Navigate to https://gstg.gitlab.com/admin/geo_nodes or https://gprd.gitlab.com/admin/geo_nodes
1. Wait for “repositories synced” and “wikis synced” to reach 100%
  1. If needed, see Rails console commands to resync repos/wikis: https://gitlab.com/snippets/1713152
2. Press the “advanced” tab and wait for “repositories verified” and “wikis verified” to reach 100%
  1. If needed, see https://gitlab.com/snippets/1713152#verify-repos-after-successful-sync for how to manually verify after resync
3. Wait for the database replication lag to reach 0ms
4. Wait for the Geo log cursor lag to reach 0 events
Navigate to https://gstg.gitlab.com/admin/background_jobs or https://gprd.gitlab.com/admin/background_jobs, press “Queues”, then “Live Poll”, and wait for all the Geo queues (those with geo in the name) to reach 0

At this point all data on the primary should be present in exactly the same form on the secondary. There is no outstanding work in sidekiq on the primary or secondary.

Promote the secondary

Turn off the azure environment. Keep everything, just ensure it’s turned off
Trigger the postgresql failover, making the read-only replica in gstg / gprd read-writeable
1. #349 (closed)
2. This should not turn the old primary postgresql servers into followers
3. Maybe turn off all the azure postgresql servers before starting
Remove geo_secondary_role['enable'] = true from gitlab.rb on every gstg / gprd node
Run gitlab-ctl reconfigure on every changed gstg / gprd node
Run gitlab-rake geo:set_secondary_as_primary on one of the gstg / gprd nodes

During-Blackout QA

All "during the blackout" QA automated tests have succeeded - @meks
All "during the blackout" QA manual tests have succeeded - @meks

Complete the Migration

Update the staging.gitlab.com DNS entries to refer to the GCP load-balancer
Remove the broadcast message
Re-enable mailing queues on sidekiq-asap (revert chef-repo!1922)
1. admin_emails queue
2. emails_on_push queue
3. mailers queue
Ensure all "after the blackout" QA automated tests have succeeded - @meks
Ensure all "after the blackout" QA manual tests have succeeded - @meks

Failback, discarding changes made to GCP (STAGING ONLY)

Since staging is multi-use and we want to run the failover multiple times anyway, we need these steps.

In the event of discovering a problem doing the failover on GitLab.com “for real” (i.e., before opening it up to the public), it will also be super-useful to have this documented and tested

Revert the postgresql failover, so the data on the stopped primary staging nodes becomes canonical again and the secondary staging nodes replicate from it
Re-add geo_secondary_role['enable'] = true on every gstg node
Run gitlab-ctl reconfigure on every changed gstg node
Update the staging.gitlab.com DNS entries to refer to the Azure load-balancer
???

Edited May 09, 2018 by Nick Thomas

To upload designs, you'll need to enable LFS and have an admin enable hashed storage. More information