2018-05-17: staging failover preflight checks

Pre-flight checks

GitLab Version Checks

  1. Ensure that both sides to be running the same minor version.
    • Versions can be confirmed using the Omnibus version tracker dashboards:
      • Staging
        • GCP gstg: https://performance.gstg.gitlab.net/d/TvELheimz/gitlab-omnibus-versions
        • Azure Staging: https://performance.gitlab.net/dashboard/db/gitlab-omnibus-versions?var-environment=stg
      • Production
        • GCP gprd: https://performance.gprd.gitlab.net/d/TvELheimz/gitlab-omnibus-versions
        • Azure Production: https://performance.gitlab.net/dashboard/db/gitlab-omnibus-versions?var-environment=prd

Object storage

  1. Ensure primary and secondary share the same object storage configuration. In config/gitlab.yml, the following keys:
    1. uploads
    2. lfs
    3. artifacts
  2. Ensure the container registry has the same object storage configuration on primary and secondary
  3. Ensure all files are in object storage
    • If direct upload isn’t enabled, these numbers may fluctuate slightly as files are uploaded to disk, then moved to object storage
    • On staging, these numbers are non-zero. Just mark as checked.
    1. Upload.with_files_stored_locally.count # => 0
    2. LfsObject.with_files_stored_locally.count # => 0
    3. Ci::JobArtifact.with_files_stored_locally.count # => 0

Configuration checks

  1. Ensure gitlab-rake gitlab:geo:check reports no errors on the primary
  2. Ensure gitlab-rake gitlab:geo:check reports no errors on the secondary
  3. Compare some files on a representative node (a web worker) between primary and secondary:
    1. Manually compare the diff of the /etc/gitlab/gitlab.rb config file.
    2. Manually compare the diff of the /etc/gitlab/gitlab-secrets.json
  4. Check SSH host keys match by comparing the output of these commands:
    • ssh-keyscan staging.gitlab.com | ssh-keygen -lf -
    • ssh-keyscan gstg.gitlab.com | ssh-keygen -lf -
  5. Ensure repository and wiki verification feature flag shows as enabled on both primary and secondary
    • Feature.enabled?(:geo_repository_verification)
  6. Ensure the TTL for the staging.gitlab.com DNS records is low (300 seconds is fine)
  7. Ensure the SSL configuration on the secondary is valid for both the current domain name and staging.gitlab.com
    1. openssl s_client -connect gstg.gitlab.com:443 | openssl x509 -noout -text

Ensure Geo replication is up to date

  1. Ensure sidekiq is healthy: fewer than 10,000 jobs should be enqueued
    • Navigate to https://staging.gitlab.com/admin/background_jobs
  2. Ensure repositories and wikis are at least 99% complete, 0 failed (that’s zero, not 0%):
    • Navigate to https://staging.gitlab.com/admin/geo_nodes and check
    • See https://gitlab.com/snippets/1713152 for how to reschedule failures for resync
    • In staging, some failures and unsynced repositories are expected
  3. Local attachments, CI artifacts and LFS objects should have 0 in all columns
    • In staging, some failures and unsynced files are expected

Verify the integrity of replicated repositories and wikis

  1. Ensure that repository and wiki verification is at least 99% complete, 0 failed (that’s zero, not 0%):
    • Navigate to https://gstg.gitlab.com/admin/geo_nodes
    • Review the numbers under the Verification Information tab for the secondary node
    • If failures appear, see https://gitlab.com/snippets/1713152#verify-repos-after-successful-sync for how to manually verify after resync
  2. No need to verify the integrity of anything in object storage

Pages

  1. Verify that Pages Azure-to-GCP Proxy is correctly working (see #159 (closed))
  2. Perform GitLab Pages data verification (see #388 (closed))

Schedule the failover

  1. Pick a date and time for the failover itself that won't interfere with the release team's work.
  2. Verify with RMs for the next release that the chosen date is OK
  3. Create a new issue in the tracker using the "failover" template
  4. Create a new issue in the tracker using the "test plan" template
  5. Add a downtime notification to any affected QA issues in https://gitlab.com/gitlab-org/release/tasks/issues
Edited May 17, 2018 by Nick Thomas
Assignee Loading
Time tracking Loading