2018-05-17: staging failover preflight checks
Pre-flight checks
GitLab Version Checks
-
Ensure that both sides to be running the same minor version. - Versions can be confirmed using the Omnibus version tracker dashboards:
- Staging
- Production
- Versions can be confirmed using the Omnibus version tracker dashboards:
Object storage
-
Ensure primary and secondary share the same object storage configuration. In config/gitlab.yml, the following keys:-
uploads -
lfs -
artifacts
-
-
Ensure the container registry has the same object storage configuration on primary and secondary -
Ensure all files are in object storage - If direct upload isn’t enabled, these numbers may fluctuate slightly as files are uploaded to disk, then moved to object storage
- On staging, these numbers are non-zero. Just mark as checked.
-
Upload.with_files_stored_locally.count # => 0 -
LfsObject.with_files_stored_locally.count # => 0 -
Ci::JobArtifact.with_files_stored_locally.count # => 0
Configuration checks
-
Ensure gitlab-rake gitlab:geo:checkreports no errors on the primary -
Ensure gitlab-rake gitlab:geo:checkreports no errors on the secondary - Compare some files on a representative node (a web worker) between primary and secondary:
-
Manually compare the diff of the /etc/gitlab/gitlab.rbconfig file. -
Manually compare the diff of the /etc/gitlab/gitlab-secrets.json
-
-
Check SSH host keys match by comparing the output of these commands: ssh-keyscan staging.gitlab.com | ssh-keygen -lf -ssh-keyscan gstg.gitlab.com | ssh-keygen -lf -
-
Ensure repository and wiki verification feature flag shows as enabled on both primary and secondary Feature.enabled?(:geo_repository_verification)
-
Ensure the TTL for the staging.gitlab.comDNS records is low (300 seconds is fine) -
Ensure the SSL configuration on the secondary is valid for both the current domain name and staging.gitlab.com-
openssl s_client -connect gstg.gitlab.com:443 | openssl x509 -noout -text
-
Ensure Geo replication is up to date
-
Ensure sidekiq is healthy: fewer than 10,000 jobs should be enqueued - Navigate to https://staging.gitlab.com/admin/background_jobs
-
Ensure repositories and wikis are at least 99% complete, 0 failed (that’s zero, not 0%): - Navigate to https://staging.gitlab.com/admin/geo_nodes and check
- See https://gitlab.com/snippets/1713152 for how to reschedule failures for resync
- In staging, some failures and unsynced repositories are expected
-
Local attachments, CI artifacts and LFS objects should have 0 in all columns - In staging, some failures and unsynced files are expected
Verify the integrity of replicated repositories and wikis
-
Ensure that repository and wiki verification is at least 99% complete, 0 failed (that’s zero, not 0%): - Navigate to https://gstg.gitlab.com/admin/geo_nodes
- Review the numbers under the
Verification Informationtab for the secondary node - If failures appear, see https://gitlab.com/snippets/1713152#verify-repos-after-successful-sync for how to manually verify after resync
- No need to verify the integrity of anything in object storage
Pages
-
Verify that Pages Azure-to-GCP Proxy is correctly working (see #159 (closed)) -
Perform GitLab Pages data verification (see #388 (closed))
Schedule the failover
-
Pick a date and time for the failover itself that won't interfere with the release team's work. -
Verify with RMs for the next release that the chosen date is OK -
Create a new issue in the tracker using the "failover" template -
Create a new issue in the tracker using the "test plan" template -
Add a downtime notification to any affected QA issues in https://gitlab.com/gitlab-org/release/tasks/issues
Edited by Nick Thomas