2018-05-17: staging failover attempt

T minus 1 day (2018-05-16)

  1. Create the QA testing issue using the template: #441 (closed)
  2. Perform Pages Azure-to-GCP rsync
    • Validate lsyncd state
  3. Perform Preflight Checklist: #439 (closed)
  4. PRODUCTION ONLY UNTESTED Update GitLab shared runners to expire jobs after 1 hour

T minus 1 hour (2018-05-17 12:00 UTC)

STAGING FAILOVER TESTING ONLY: to speed up testing, this step can be done less than 1 hour before failover

GitLab runners attempting to post artifacts back to GitLab.com during the maintenance window will fail and the artifacts may be lost. To avoid this as much as possible, we'll block any new runner jobs from runner, starting an hour before the scheduled maintenance window.

  1. Stop any new GitLab CI jobs from being executed
    • Block POST /api/v4/jobs/request

T minus zero (failover day) (Date TBD)

Failover Procedure

Notify users of scheduled maintenance

  • Create a broadcast message
    • Navigate to https://staging.gitlab.com/admin/broadcast_messages
    • Text: staging.gitlab.com is moving to a new home! Hold onto your hats, we’re going dark for approximately 1 hour from XX:XX on 2018-XX-YY
    • Start date: now. End date: expected end of the failover window

STAGING FAILOVER TESTING ONLY Snapshot staging machines

Staging is a multi-use environment, and we want to practice failover multiple times with as little friction as possible. Taking a backup of the database will allow us to recover from the most likely errors without having to rebuild the whole enviornment

  1. Snapshot the database disks on Azure
  2. Snapshot the database disks on GCP

We should do this asynchronously, but as close to the start of the failover window as possible. E.g., if the failover is planned for 1pm UTC, perhaps at 12:50PM UTC.

Prevent updates to the primary

  1. Update Azure NSG (network security groups) to drop non-VPN traffic:
    • https://gitlab.com/gitlab-com/gitlab-com-infrastructure/merge_requests/349
  2. Ensure traffic from a non-VPN IP is blocked
    1. SSH: gitlab-rake gitlab:tcp_check[staging.gitlab.com,22]
    2. HTTP: gitlab-rake gitlab:tcp_check[staging.gitlab.com,80]
    3. HTTPS: gitlab-rake gitlab:tcp_check[staging.gitlab.com,443]
    4. *PRODUCTION ONLY UNTESTED AltSSH: gitlab-rake gitlab:tcp_check[altssh.gitlab.com,443]
  3. Running CI jobs will no longer be able to push updates. Jobs that complete now may be lost.
  4. Disable Sidekiq crons that may cause updates on the primary
    1. Navigate to https://staging.gitlab.com/admin/background_jobs / https://gitlab.com/admin/background_jobs
    2. Press Cron -> Disable all
    3. Enable geo_metrics_update_worker, geo_prune_event_log_worker and geo_repository_verification_primary_batch_worker
  5. Wait for all Sidekiq jobs to complete on the primary
    1. Navigate to https://staging.gitlab.com/admin/background_jobs / https://gitlab.com/admin/background_jobs
    2. Press Queues -> Live Poll
    3. Wait for all queues not mentioned above to reach 0
    4. Wait for the number of Busy jobs to reach 0

Finish replicating and verifying all data

  1. Ensure any data not replicated by Geo is replicated manually. We know about these:
    1. Container Registry
      • Hopefully this is a shared object storage bucket, in which case this can be removed
    2. GitLab Pages
      • Check that lsyncd is up to date? Run rsync command?
    3. CI traces in Redis
      • Run ::Ci::BuildTraceChunk.redis.find_each(batch_size: 10, &:use_database!)
  2. Navigate to https://gstg.gitlab.com/admin/geo_nodes or https://gprd.gitlab.com/admin/geo_nodes
  3. Wait for all repositories and wikis to become synchronized
    1. Press "Sync Information"
    2. Wait for "repositories synced" and "wikis synced" to reach 100% with 0 failures
    3. If failures appear, see Rails console commands to resync repos/wikis: https://gitlab.com/snippets/1713152
  4. Wait for all repositories and wikis to become verified
    1. Press "Verification Information"
    2. Wait for "repositories verified" and "wikis verified" to reach 100% with 0 failures
    3. If failures appear, see https://gitlab.com/snippets/1713152#verify-repos-after-successful-sync for how to manually verify after resync
  5. In "Sync Information", wait for "Data replication lag" to read 1m or less
  6. In "Sync Information", wait for "Last event ID seen from primary" to equal "Last event ID processed by cursor"
  7. Wait for all Sidekiq jobs to complete on the secondary
    1. Navigate to https://gstg.gitlab.com/admin/background_jobs / https://gprd.gitlab.com/admin/background_jobs
    2. Press Queues -> Live Poll
    3. Wait for all queues to reach 0
    4. Wait for the number of Busy jobs to reach 0
  8. Now disable all sidekiq-cron jobs on the secondary
    1. Navigate to https://gstg.gitlab.com/admin/background_jobs / https://gprd.gitlab.com/admin/background_jobs
    2. Press Cron
    3. Press Disable all

At this point all data on the primary should be present in exactly the same form on the secondary. There is no outstanding work in sidekiq on the primary or secondary, and if we failover, no data will be lost.

Stopping all cronjobs on the secondary means it will no longer attempt to run background synchronization operations against the primary, reducing the chance of errors while it is being promoted.

Promote the secondary

  1. Gracefully turn off the Azure postgresql instances.
    • Keep everything, just ensure it’s turned off
    • gitlab-rake stop postgresql
  2. Trigger the postgresql failover, making the read-only replica in gstg / gprd read-writeable
    • Staging:
      1. sudo /opt/gitlab/embedded/bin/gitlab-pg-ctl promote
    • Production:
      1. Repmgr process for production: #349 (closed)
      2. This should not turn the old primary postgresql servers into followers
  3. Check the database is now read-write
    1. SQL, looking for F as the result: select * from pg_is_in_recovery();
  4. Run gitlab-rake geo:set_secondary_as_primary on one of the gstg / gprd nodes
  5. Update the chef configuration according to https://dev.gitlab.org/cookbooks/chef-repo/merge_requests/1989
  6. Run chef-client on every node to ensure Chef changes are applied and all Geo secondary services are stopped
    • STAGING knife ssh roles:gstg-base 'sudo chef-client'
    • PRODUCTION UNTESTED knife ssh roles:gprd-base 'sudo chef-client'

Pages

  1. DNS CHANGES: In Route 53, point *.githost.io to new GCP Pages LB
  2. STOP SERVING PAGES AT OLD IP AND START PROXY SERVICE TO NEW UP: Modify Azure Pages LB Nodes
    1. Complete the MR at https://dev.gitlab.org/cookbooks/chef-repo/merge_requests/1987
    2. Complete a chef-client run on the gitlab-base-lb-pages role

During-Blackout QA

  • All "during the blackout" QA automated tests have succeeded - @meks
  • All "during the blackout" QA manual tests have succeeded - @meks

Complete the Migration

  1. Update the staging.gitlab.com DNS entries to refer to the GCP load-balancer
  2. Remove the broadcast message
  3. PRODUCTION ONLY UNTESTED Re-enable mailing queues on sidekiq-asap (revert chef-repo!1922)
    1. admin_emails queue
    2. emails_on_push queue
    3. mailers queue
  4. Ensure all "after the blackout" QA automated tests have succeeded - @meks
  5. Ensure all "after the blackout" QA manual tests have succeeded - @meks

STAGING ONLY Failback, discarding changes made to GCP

Since staging is multi-use and we want to run the failover multiple times anyway, we need these steps.

In the event of discovering a problem doing the failover on GitLab.com "for real" (i.e. before opening it up to the public), it will also be super-useful to have this documented and tested.

  1. Re-add geo_secondary_role['enable'] = true on every gstg node
  2. Run gitlab-ctl reconfigure on every changed gstg node
  3. Update the staging.gitlab.com DNS entries to refer to the Azure load-balancer
  4. Failover went well
    1. Revert the postgresql failover, so the data on the stopped primary staging nodes becomes canonical again and the secondary staging nodes replicate from it
    2. Delete the snapshots
  5. Failover went badly
    1. Restore the database nodes from snapshot
  6. Turn on the azure environment
  7. Enable access to the azure environment from the outside world
  8. Re-enable cronjobs on the primary
    • Navigate to https://staging.gitlab.com/admin/background_jobs, press "Cron"
    • Find the geo_sidekiq_cron_config_worker row and press "Enable" on it
    • All but the Geo-secondary-only queues will be re-enabled
  9. Ensure rake gitlab:geo:check passes on both primary and secondary
Edited May 17, 2018 by Alejandro Rodríguez
Assignee Loading
Time tracking Loading