Skip to content
Snippets Groups Projects
Open Geo: Promoting a secondary should be simple
  • Geo: Promoting a secondary should be simple

  • Geo: Promoting a secondary should be simple

    Open Epic created by Fabian Zimmer

    Introduction

    It is currently possible to promote a secondary node to a primary node, either during a planned failover or in a genuine disaster recovery situation. Geo supports promotion for a single node installation and for an HA configuration. The current promotion process is consists of a large number of manual preflight checks, followed by the actual promotion. The promotion is only possible in the command line, no UI flow is possible and for high-availability configurations modifications to the gitlab.rb file are required on almost all nodes. Given the critical nature of this process, Geo should make it simple to promote a secondary, especially for more complex high-availability configurations.

    Problem to solve

    The current promotion process consists of two main phases:

    1. A number of pre-flight checks; this only applies in a planned failover
    2. The actual promotion process of a secondary node.

    The seven pre-flight checks are highly manual and involve different interactions with the UI and the command line. These pre-flight checks should be as automatic as possible because failing to perform them may expose users to additional risk during the actual promotion.

    After permanently disabling the primary (to avoid split brain) the promotion of a secondary can be performed in two different ways:

    • On a single node via the gitlab-ctl promote-to-primary-node command
    • In an HA configuration by changing the gitlab.rb configuration on every node and by running sudo gitlab-pg-ctl promote on the postgres node and then running sudo gitlab-rake geo:set_secondary_as_primary on an app node.

    This process again is highly-manual and error prone. Changing gitlab.rb configuration on dozens of nodes in fully scaled architectures will take time and it is easy to miss a node. Two commands for different configurations depend on prior knowledge of the systems administrator what kind of architecture is deployed; the tool itself has no way to detect this.

    This is the current flow:

    Screenshot_2020-04-22_at_16.15.38

    Further details

    Proposal

    • Create a single command that is valid for a single node and for HA e.g. gitlab-ctl promote-to-primary-node
    • Orchestrate changes to configuration across a fleet of nodes
    • The tool performs most of the preflight checks and warns user of impact when proceeding
    • Should support reference architecture(s)
    • Should be able to determine if it can be used (everything managed by omnibus)

    Idealised flow: Screenshot_2020-04-22_at_16.28.34

    Permissions and Security

    There are security implications when orchestrating change across nodes. Some of the changes may require sudo access

    Documentation

    The current DR documentation requires a major revamp and would have to be adjusted in conjunction with this change.

    Availability & Testing

    We would have to perform thorough testing of the tool on HA reference architectures. Given the criticality of this process testing is imperative.

    What does success look like, and how can we measure that?

    • We can measure the number of manual steps it takes to promote a HA configuration and the reductions that were possible
    • A single command line tool that allows for the promotion of a secondary.
    • Reduction of manual steps from 20-25 to < 10

    What is the type of buyer?

    Premium and Ultimate

    Is this a cross-stage feature?

    Potentially relevant to work in Distribution.

    Links

    Edited by Fabian Zimmer

    Linked items 0

  • Link items together to show that they're related or that one is blocking others.

    Activity

    • All activity
    • Comments only
    • History only
    • Newest first
    • Oldest first