2018-05-08 Staging failover preflight checks

Pre-flight checks

Object storage

  1. Ensure primary and secondary share the same object storage configuration

    1. The uploads, lfs and artifacts keys in config/gitlab.yml. If the container registry were enabled, we’d check the config for that too.
      • Ran md5sum /etc/gitlab/gcs-creds.json on both machines
      • GSTG has google_storage_secret_access_key and google_storage_access_key_id: does it matter?
  2. Ensure all files are in object storage (see #394 (comment 72218060))

    1. [-] Upload.with_files_stored_locally.count # => 0
    2. [-] LfsObject.with_files_stored_locally.count # => 0
    3. [-] Ci::JobArtifact.with_files_stored_locally.count # => 0
    4. [-] If direct upload isn’t enabled, these numbers may fluctuate slightly as files are uploaded to disk, then moved to object storage (FIX in template: this should be an information, rather than action, point)

Configuration checks

  1. Manually compare the diff of the gitlab.rb config between representative primary and secondary nodes. Probably on a web worker node. Check for settings enabled on the primary but disabled on the secondary.
  2. Ensure gitlab-rake gitlab:check gitlab:geo:check reports no errors on the primary
  3. Ensure gitlab-rake gitlab:check gitlab:geo:check reports no errors on the secondary
  4. Ensure sudo sha256sum /etc/ssh/ssh_host* /etc/gitlab/gitlab-secrets.json returns the same values on representative primary and secondary nodes (again, likely use a web worker)
  5. Ensure repository and wiki verification feature flag is enabled
    • Feature.enabled?(:geo_repository_verification) => true
  6. Ensure the TTL for the staging.gitlab.com DNS records is low (300 seconds is fine)
  7. [-] Ensure the secondary can send emails (FIX in template: can't do on staging right now)
    1. [-] Run the following in a Rails console (changing you to yourself):
      • Notify.test_email("you+test@gitlab.com", "Test email", "test") => FAIL
    2. [-] Ensure you receive the email
  8. Ensure the SSL configuration on the secondary is valid for both the current domain name and staging.gitlab.com
    • => openssl s_client -connect gstg.gitlab.com:443 | openssl x509 -noout -text works (FIX in template)

Ensure Geo replication is up to date

  1. Ensure sidekiq is healthy: Navigate to https://staging.gitlab.com/admin/background_jobs
    1. Fewer than 10,000 jobs should be enqueued
  2. [-] Ensure repositories and wikis are at least 99% complete, 0 failed (that’s zero, not 0%) by navigating to https://staging.gitlab.com/admin/geo_nodes and reviewing the numbers there. See https://gitlab.com/snippets/1713152 for how to reschedule failures for resync.
  3. [-] Local attachments, CI artifacts and LFS objects should have 0 in all columns

Verify the integrity of replicated repositories and wikis

  1. [-] Ensure that repository and wiki verification is at least 99% complete, 0 failed (that’s zero, not 0%) by navigating to https://staging.gitlab.com/admin/geo_nodes , going to the Advanced tab for the secondary and reviewing the numbers there
  2. [-] No need to verify the integrity of anything in object storage (FIX in template: this should be informational)

Resolve scheduling conflicts with release team (FIX: add to template)

  1. Pick a date and time for the failover itself that won't interfere with the release team's work.
  2. Verify with RMs ahead of time that the chosen date is OK
  3. Add a downtime notification to any open QA issues in https://gitlab.com/gitlab-org/release/tasks/issues
Edited May 09, 2018 by Nick Thomas
Assignee Loading
Time tracking Loading