2018-05-09: Staging failover attempt checklist
T minus 3 weeks (Date TBD) (PRODUCTION ONLY)
- [-] Notify content team of upcoming announcements to give them time to prepare blog post, email content. Link to blog MR.
T minus 1 week (Date TBD) (PRODUCTION ONLY)
- [-] Perform Preflight Checklist
- [-] Andrew/Eliran to communicate date to Google
- [-] Andrew to announce in #general slack and on team call date of failover.
- [-] Marketing team publish blog post about upcoming GCP failover
- [-] Marketing team sends out an email to all users notifying that GitLab.com will be undergoing scheduled maintenance. Email should include points on:
- Users should expect to have to re-authenticate after the outage, as authentication cookies will be invalidated after the failover.
- Details of our backup policies to assure users that their data is safe
T minus 1 day (2018-08-08)
-
Create the QA testing issue using the template: #397 (closed) -
Perform Preflight Checklist: #394 (closed)
T minus zero (failover day) (2018-08-09)
Failover Procedure
Notify users of scheduled maintenance
-
Create a broadcast message along the lines of “staging.gitlab.com is moving to a new home! Hold onto your hats, we’re going dark for approximately 1 hour from XX:XX on 2018-XX-YY” On the secondary, clear the Redis cache to show the broadcast message there too: sudo gitlab-rake cache:clear:redis
(FIX template: cache clear is unnecessary)
Snapshot staging machines (STAGING FAILOVER TESTING ONLY)
-
Staging is a multi-use environment, and we want to practice failover multiple times with as little friction as possible. -
To facilitate these cases, we should take a snapshot of at least the database disks on the secondary before starting work. This will allow us to return to the previous state following a failover with a minimum of effort
Prevent updates to the primary
-
In Azure, prevent incoming traffic from everyone but those involved in the failover and the gstg
/gprd
GCP environment… (https://gitlab.com/gitlab-com/gitlab-com-infrastructure/merge_requests/349)-
Update Azure NSG (network security groups) to drop external traffic.
-
-
Ensure git traffic from a third-party IP is blocked -
Ensure HTTPS traffic from a third-party IP is blocked -
Navigate to https://staging.gitlab.com/admin/background_jobs
/https://gitlab.com/admin/background_jobs
, press “Cron”, “Disable all”, then find the “geo_sidekiq_cron_config_worker” row and press “Enable” on it. A few more crons will self-enable after this point -
Navigate to https://staging.gitlab.com/admin/background_jobs
/https://gitlab.com/admin/background_jobs
, press “Queues” then “Live Poll” , and wait for all non-geo queues to reach 0 (geo queues havegeo
in the name) -
Running CI jobs will not be able to push updates at this point. If a job completes during the maintenance window, its status will be lost.
Finish replicating and verifying all data
-
Navigate to https://gstg.gitlab.com/admin/geo_nodes
orhttps://gprd.gitlab.com/admin/geo_nodes
-
Wait for “repositories synced” and “wikis synced” to reach 100% -
If needed, see Rails console commands to resync repos/wikis: https://gitlab.com/snippets/1713152
-
-
Press the “advanced” tab and wait for “repositories verified” and “wikis verified” to reach 100% -
If needed, see https://gitlab.com/snippets/1713152#verify-repos-after-successful-sync for how to manually verify after resync
-
-
Wait for the database replication lag to reach 0ms -
Wait for the Geo log cursor lag to reach 0 events
-
-
Navigate to https://gstg.gitlab.com/admin/background_jobs
orhttps://gprd.gitlab.com/admin/background_jobs
, press “Queues”, then “Live Poll”, and wait for all the Geo queues (those with geo in the name) to reach 0
At this point all data on the primary should be present in exactly the same form on the secondary. There is no outstanding work in sidekiq on the primary or secondary.
Promote the secondary
-
Turn off the azure environment. Keep everything, just ensure it’s turned off -
Trigger the postgresql failover, making the read-only replica in gstg
/gprd
read-writeable- #349 (closed)
- This should not turn the old primary postgresql servers into followers
- Maybe turn off all the azure postgresql servers before starting
-
Remove geo_secondary_role['enable'] = true
from gitlab.rb on everygstg
/gprd
node -
Run gitlab-ctl reconfigure on every changed gstg
/gprd
node -
Run gitlab-rake geo:set_secondary_as_primary
on one of thegstg
/gprd
nodes
During-Blackout QA
-
All "during the blackout" QA automated tests have succeeded - @meks -
All "during the blackout" QA manual tests have succeeded - @meks
Complete the Migration
-
Update the staging.gitlab.com
DNS entries to refer to the GCP load-balancer -
Remove the broadcast message -
Re-enable mailing queues on sidekiq-asap (revert chef-repo!1922) -
admin_emails
queue -
emails_on_push
queue -
mailers
queue
-
-
Ensure all "after the blackout" QA automated tests have succeeded - @meks -
Ensure all "after the blackout" QA manual tests have succeeded - @meks
Failback, discarding changes made to GCP (STAGING ONLY)
Since staging is multi-use and we want to run the failover multiple times anyway, we need these steps.
In the event of discovering a problem doing the failover on GitLab.com “for real” (i.e., before opening it up to the public), it will also be super-useful to have this documented and tested
-
Revert the postgresql failover, so the data on the stopped primary staging nodes becomes canonical again and the secondary staging nodes replicate from it -
Re-add geo_secondary_role['enable'] = true
on every gstg node -
Run gitlab-ctl reconfigure on every changed gstg node -
Update the staging.gitlab.com
DNS entries to refer to the Azure load-balancer -
???
Edited by Nick Thomas