Proposal for GitLab HA/Geo to use Consul
GitLab HA/Geo + Consul
This is a proposal for better Consul integration into our process, for after the GA milestone.
cc @stanhu @marin @twk3 @nick.thomas @dbalexandre @to1ne @jarv
Introduction
GitLab can span-out from a single Raspberry machine to multiple geographical locations when using Geo. Also for each location it can be configured in different ways and scaled independently.
For example, we could theoretically run Geo in single Raspberry machines, spreading the load into multiple raspberry machines in each location, or deciding that some locations deserve HA while others will work on a single machines, etc.
This makes the bootstrapping slightly complicated, also requires many copy and paste of credentials so each machine can figure out each other, can figure out the dependent services and locations needs to replicate secrets and other configurations from the primary location.
This also makes promoting a location to primary a task that requires multiple steps.
All of this is due to the fact that we use a provisioning system not intended to be used in a multiple-machine environment. If, instead of Chef Omnibus we used Chef Server, we would have the benefit of the data bags and the search API.
Because of this very reason, the successor for Omnibus, Chef's Habitat has a service discovery / metadata system that is "cluster aware", build on top of a gossip protocol with leader election etc. Their description sounds very similar to consul, and after looking at their issues list it's like it really is very similar to consul (if you exclude the supervisor part).
So to achieve similar results without moving away from Omnibus Chef, we can build a custom solution using consul,
some smart recipes that queries consul, and optionally trigger gitlab-ctl reconfigure
in specific situations.
Existing Consul use at GitLab
Consul is already available in Omnibus GitLab, it was introduced as part of the HA solution for PostgreSQL + repmgr. As the feature is restricted to EEP, Consul is only available in EE packages (not sure if/how license checking is being handled).
Because PostgreSQL HA solution requires PG Bouncer, most of credentials logic goes around PG Bouncer, and Consul as today is used just partially, like to reconfigure PG Bouncer, but not to handle credentials across multiple nodes.
Potential use for GitLab Geo (and single location's HA)
The most time consuming and error-prone task in enabling Geo is making use every node have all the required credentials and share the same secrets.json file.
In an ideal situation, we should be able to spin up new nodes, with either full GitLab services (for a single node Geo location), or with specific services (for a multi node Geo location), and the only configuration we should be required to do, excluding the specification of services and machine-specific tunings, would be related to Consul / node awareness.
Single machine Geo setup
Some pseudo /etc/gitlab/gitlab.rb
file for a single-machine Geo install, primary node:
external_url 'http://gitlab.example.com'
roles ['geo_primary_role']
# You either have GitLab in a single machine or
# GitLab cluster for multiple machines in single geographical location.
# By enabling this, we will use Consul and every machine will be aware of each other
gitlab_cluster['enable'] = true
# Geo location name, this is used to "shard" resources for each Location you have Geo enabled
gitlab_cluster['geo'] = 'us-east'
Some pseudo /etc/gitlab/gitlab.rb
file for a single-machine Geo install, secondary node (EU location):
external_url 'http://ams.gitlab.example.com'
roles ['geo_secondary_role']
# You either have GitLab in a single machine or
# GitLab cluster for multiple machines in single geographical location.
# By enabling this, we will use Consul and every machine will be aware of each other
gitlab_cluster['enable'] = true
# Geo location name, this is used to "shard" resources for each Location you have Geo enabled
gitlab_cluster['geo'] = 'eu-ams'
Some pseudo /etc/gitlab/gitlab.rb
file for a single-machine Geo install, secondary node (CN location):
external_url 'http://cn.gitlab.example.com'
roles ['geo_secondary_role']
# custom SSH port only at this location
gitlab_rails['gitlab_shell_ssh_port'] = 2200
# You either have GitLab in a single machine or
# GitLab cluster for multiple machines in single geographical location.
# By enabling this, we will use Consul and every machine will be aware of each other
gitlab_cluster['enable'] = true
# Geo location name, this is used to "shard" resources for each Location you have Geo enabled
gitlab_cluster['geo'] = 'china'
By using consul we get a distributed KV available in all locations, where we can populate with data to be shared
among all instances. This means for each location (gitlab_cluster['geo']
) we need to prepend the location in the
KV store to make it specific for that, or not prepend when it's intended for all locations
(like content inside secrets.json
).
So a run of gitlab-ctl reconfigure on the first node will generate secrets.json
, and will publish it's values to
Consul.
The execution on the other nodes, will read values from Consul to populate local instance. Because every machine now has all the keys, they can even make API calls and register themselves as a Geo Node, requiring no additional manual step in the Admin UI.
Single-machine with remote resources setup
This is a slightly simple setup, that makes sense even for small setups (which means we should consider supporting into something like EES), which does not include HA, but still allows for better resource managements:
- GitLab Unicorn machine
- Onnibus PostgreSQL machine
- Omnibus Redis machine
In this type of setup gitlab_cluster['geo']
is not available as it's a single location.
The reason to allow this in EES is to give EES users a sneak-peak on how easy they can scale-out in EEP.
GitLab Unicorn machine
Some pseudo /etc/gitlab/gitlab.rb
file for a single-machine running unicorn and GitLab custom configs:
external_url 'http://gitlab.example.com'
# custom settings:
gitlab_rails['registry_enabled'] = true
gitlab_rails['registry_host'] = "registry.gitlab.example.com"
gitlab_pages['enable'] = true
roles ['application']
# ...
# You either have GitLab in a single machine or
# GitLab cluster for multiple machines in single geographical location.
# By enabling this, we will use Consul and every machine will be aware of each other
gitlab_cluster['enable'] = true
GitLab Redis machine
Some pseudo /etc/gitlab/gitlab.rb
file for a single-machine running redis:
# custom settings:
roles ['redis'] # disable all services except for redis
# ...
# You either have GitLab in a single machine or
# GitLab cluster for multiple machines in single geographical location.
# By enabling this, we will use Consul and every machine will be aware of each other
gitlab_cluster['enable'] = true
By advertising itself and credentials data in Consul, it allows the first node to populate resque.yml
correctly
after a gitlab-ctl reconfigure
, which should be executed after this (or all resource nodes are already configured).
Because this is a single machine setup, we should lock redis resource for this node in Consul, so if another node
with same configuration boots-up, we should raise an exception saying "Redis" is already provisioned at
node xxx.xxx.xxx.xxx, and if they want to unregister it they should run a specific command
like: gitlab-cluster unregister redis
.
GitLab PostgreSQL machine
This is not a HA setup, this is just a single PostgreSQL machine managed by Omnibus. Enabling HA setup should look similar to this with maybe extra flag.
Some pseudo /etc/gitlab/gitlab.rb
file for a single-machine running redis:
# custom settings:
roles ['postgresql'] # disable all services except for postgresql
postgresql['listen_address'] = "10.0.0.5" # listen in custom internal IP from cloud provider X
# We will populate nodes below by querying Consul, unless you uncomment them:
# postgresql['trust_auth_cidr_addresses'] = ['10.0.0.5/32','10.0.0.4/32'
# postgresql['md5_auth_cidr_addresses'] = ['10.0.0.5/32', '10.0.0.4/32']
# ...
# You either have GitLab in a single machine or
# GitLab cluster for multiple machines in single geographical location.
# By enabling this, we will use Consul and every machine will be aware of each other
gitlab_cluster['enable'] = true
What this configuration will make is it will make PostgreSQL available in Consul KV store, with required credentials.
Also it will query existing nodes with "application" role available in the Cluster and get their IPs into allowed
list (like: trust_auth_cidr_addresses
and md5_auth_cidr_addresses
).
By advertising itself and credentials data in Consul, it allows the first node to populate database.yml
correctly
after a gitlab-ctl reconfigure
, which should be executed after this (or all resource nodes are already configured).
Because this is a single machine setup, we should lock redis resource for this node in Consul, similarly to redis setup.
The GitLab cluster command
This should give Administrators visibility of all nodes that are part of GitLab cluster, few example functionality:
gitlab-cluster help
Available commands:
list machines - List all machines that are part of the cluster
list geo - List all available Geo locations
unregister <service> - Remove service from cluster
unregister <service> <machine> - Remove service availability from specific machine in a HA setup
reconfigure - Triggers gitlab-ctl reconfigure in the whole cluster
...
Available options:
--geo=<location-string> (ex: us-east)
--force - don't ask for confirmation
It should query/write to GitLab Consul, to provide functionality. Reconfigure could be possible by triggering an event
to Consul and configure it to execute a script which will invoke gitlab-ctl reconfigure
.
We can't abuse too much on Consul events to make things 100% automatic, as things can have a circular dependency and we need to be aware of that and prevent a notification loop (machine A notifies B and because B got reconfigured notifies A again which causes B to be notified... ). Consul events can be handful in HA setups and we just need to be extra careful. Perhaps we can leverage Consul lock mechanism to prevent loops.
GitLab Cluster UI
Because every node will be registered in Consul, we can have read-access to parts of it to provide visual representation of the machines and specific configurations in Admin UI. Consul allows for access control, so we can safely expose Consul connection to rails, in a read-only, hardened account.