Geo: Promoting a secondary should be simple
Geo: Promoting a secondary should be simple
Introduction
It is currently possible to promote a secondary node to a primary node, either
during a planned failover or in a genuine disaster recovery situation. Geo
supports promotion for a single node installation and for an HA configuration.
The current promotion process is consists of a large number of manual preflight
checks, followed by the actual promotion. The promotion is only possible in the
command line, no UI flow is possible and for high-availability configurations
modifications to the gitlab.rb
file are required on almost all nodes. Given the
critical nature of this process, Geo should make it simple to promote a secondary,
especially for more complex high-availability configurations.
Problem to solve
The current promotion process consists of two main phases:
- A number of pre-flight checks; this only applies in a planned failover
- The actual promotion process of a secondary node.
The seven pre-flight checks are highly manual and involve different interactions with the UI and the command line. These pre-flight checks should be as automatic as possible because failing to perform them may expose users to additional risk during the actual promotion.
After permanently disabling the primary (to avoid split brain) the promotion of a secondary can be performed in two different ways:
- On a single node via the
gitlab-ctl promote-to-primary-node
command - In an HA configuration by changing the
gitlab.rb
configuration on every node and by runningsudo gitlab-pg-ctl promote
on the postgres node and then runningsudo gitlab-rake geo:set_secondary_as_primary
on an app node.
This process again is highly-manual and error prone. Changing gitlab.rb
configuration
on dozens of nodes in fully scaled architectures will take time and it is easy to
miss a node. Two commands for different configurations depend on prior knowledge
of the systems administrator what kind of architecture is deployed; the tool itself
has no way to detect this.
This is the current flow:
Further details
- Geo currently doesn't support HA PostgreSQL but may in the future.
- Mural board
- Consul
- GitLab Orchestrator may help here?
Proposal
- Create a single command that is valid for a single node and for HA e.g.
gitlab-ctl promote-to-primary-node
- Orchestrate changes to configuration across a fleet of nodes
- The tool performs most of the preflight checks and warns user of impact when proceeding
- Should support reference architecture(s)
- Should be able to determine if it can be used (everything managed by omnibus)
Permissions and Security
There are security implications when orchestrating change across nodes. Some of the changes may require sudo
access
Documentation
The current DR documentation requires a major revamp and would have to be adjusted in conjunction with this change.
Availability & Testing
We would have to perform thorough testing of the tool on HA reference architectures. Given the criticality of this process testing is imperative.
What does success look like, and how can we measure that?
- We can measure the number of manual steps it takes to promote a HA configuration and the reductions that were possible
- A single command line tool that allows for the promotion of a secondary.
- Reduction of manual steps from 20-25 to < 10
What is the type of buyer?
Premium and Ultimate
Is this a cross-stage feature?
Potentially relevant to work in Distribution.
Links
- Show closed items