2018-05-08 Staging failover preflight checks
Pre-flight checks
Object storage
-
Ensure primary and secondary share the same object storage configuration -
The uploads,lfsandartifactskeys inconfig/gitlab.yml. If the container registry were enabled, we’d check the config for that too.- Ran
md5sum /etc/gitlab/gcs-creds.jsonon both machines - GSTG has
google_storage_secret_access_keyandgoogle_storage_access_key_id: does it matter?
- Ran
-
-
Ensure all files are in object storage (see #394 (comment 72218060)) - [-] Upload.with_files_stored_locally.count # => 0
- [-] LfsObject.with_files_stored_locally.count # => 0
- [-] Ci::JobArtifact.with_files_stored_locally.count # => 0
- [-] If direct upload isn’t enabled, these numbers may fluctuate slightly as files are uploaded to disk, then moved to object storage (FIX in template: this should be an information, rather than action, point)
Configuration checks
-
Manually compare the diff of the gitlab.rbconfig between representative primary and secondary nodes. Probably on a web worker node. Check for settings enabled on the primary but disabled on the secondary. -
Ensure gitlab-rake gitlab:check gitlab:geo:checkreports no errors on the primary -
Ensure gitlab-rake gitlab:check gitlab:geo:checkreports no errors on the secondary -
Ensure sudo sha256sum /etc/ssh/ssh_host* /etc/gitlab/gitlab-secrets.jsonreturns the same values on representative primary and secondary nodes (again, likely use a web worker) -
Ensure repository and wiki verification feature flag is enabled -
Feature.enabled?(:geo_repository_verification)=> true
-
-
Ensure the TTL for the staging.gitlab.comDNS records is low (300 seconds is fine) - [-] Ensure the secondary can send emails (FIX in template: can't do on staging right now)
- [-] Run the following in a Rails console (changing
youto yourself):-
Notify.test_email("you+test@gitlab.com", "Test email", "test")=> FAIL
-
- [-] Ensure you receive the email
- [-] Run the following in a Rails console (changing
-
Ensure the SSL configuration on the secondary is valid for both the current domain name and staging.gitlab.com- =>
openssl s_client -connect gstg.gitlab.com:443 | openssl x509 -noout -textworks (FIX in template)
- =>
Ensure Geo replication is up to date
-
Ensure sidekiq is healthy: Navigate to https://staging.gitlab.com/admin/background_jobs -
Fewer than 10,000 jobs should be enqueued
-
- [-] Ensure repositories and wikis are at least 99% complete, 0 failed (that’s zero, not 0%) by navigating to https://staging.gitlab.com/admin/geo_nodes and reviewing the numbers there. See https://gitlab.com/snippets/1713152 for how to reschedule failures for resync.
- [-] Local attachments, CI artifacts and LFS objects should have 0 in all columns
Verify the integrity of replicated repositories and wikis
- [-] Ensure that repository and wiki verification is at least 99% complete, 0 failed (that’s zero, not 0%) by navigating to https://staging.gitlab.com/admin/geo_nodes , going to the
Advancedtab for the secondary and reviewing the numbers there - [-] No need to verify the integrity of anything in object storage (FIX in template: this should be informational)
Resolve scheduling conflicts with release team (FIX: add to template)
-
Pick a date and time for the failover itself that won't interfere with the release team's work. -
Verify with RMs ahead of time that the chosen date is OK -
Add a downtime notification to any open QA issues in https://gitlab.com/gitlab-org/release/tasks/issues
Edited by Nick Thomas