Sidekiq status / configuration helpers for planned failover

Follow-up to the initial planned failover MR: !4920 (merged)

Currently, planned failover requires us to perform complex actions against sidekiq's configuration on the primary, and extract conmplex status data from both primary and secondary during the failover itself. Both of these tasks are high-stakes and manually performing them can be expected to be error-prone, with a probable result of data loss.

We should introduce either rake tasks or UI elements that can do these tasks for us.

I think it would be good to have a UI element for a "sidekiq maintenance mode" on the primary. In this state - which could be shared with gitlab-ce - only those cronjobs essential to GitLab system function are run.

We can also display a banner or big thumbs-up or other obvious status display in the Geo nodes status page when:

  • The primary's sidekiq is in maintenance mode and all queues on the primary are empty
  • All geo-related queues are empty on the secondary

When both these conditions are met, and all replication and verification counts are at 100% done, we can show an element that advises the admin that a failover would be "safe" - i.e., that there would be no data loss. When we don't meet these conditions, we can display an element that advises the admin that a failure is not safe and that "X repositories, Y files, ...." will be lost if you fail over right now".

/cc @jramsay @stanhu


The following discussions from !4920 (merged) should be addressed:

  • @nick.thomas started a discussion:

    Even if we can't do full read-only mode, I think it'd be good to have a sidekiq maintenance mode that we could enable/disable via rake, to make these steps less handwavy. Mistakes will be really easy here.

  • @nick.thomas started a discussion:

    We could have sidekiq status rake tasks that we could run in a loop to make these instructions less ambiguous

  • @nick.thomas commented on a discussion: (+1 comment)

    The writing so far assumes very little, just that Geo is already set up and working.

    It doesn't matter how the databases are set up, or whether you're using HA or object storage. The steps are intended to be general enough to handle all that. I've added expanded wording to make the object storage cases explicit, and included a recommendation that planned failover include a separate migration to object storage to help matters.

    "How to replicate pages", etc, is explicitly out-of-scope, but with some hints in Not all data is replicated.

    Given this, I'm not sure how useful an assumed topology is. Feel free to unresolve if you think there's still value in it.

Assignee Loading
Time tracking Loading