Skip to content

Add service discovery for the DB load balancer

Yorick Peterse requested to merge db-service-discovery into master

What does this MR do?

This adds a service discovery mechanism to the DB load balancer. Using service discovery a Unicorn process can automatically refresh the list of DB secondaries it has to use, without requiring a run of chef-client.

How it works

When configuring the load balancer, instead of providing a list of hosts you provide it a DNS record to look up. When this DNS record is given we'll use service discovery. If the config/database.yml includes any explicitly configured hosts, these will be overwritten by the service discovery mechanism.

The following configuration options are provided:

  1. nameserver: defaults to localhost
  2. port: the port of the nameserver, defaults to 8600 (Consul's default port for its DNS interface)
  3. interval: the time between checks
  4. record (required): the name of the DNS record to look up (e.g. secondary.postgresql.service.consul)

The service discovery mechanism does not use SRV records for port numbers, instead it reuses the port configured for the primary (mostly because it reuses existing code that has this limitation).

Replacing of hosts happens using a mutex, preventing other requests from using the existing hosts until they are replaced. Since this process is just a simple assignment of a few instance variables, this should happen very quickly. Requests may continue to use old hosts until the request finishes, as this greatly simplifies the code.

Why was this MR needed?

When adding secondaries, or during a failover, we need to refresh the list of database secondaries to use. Previously this required the following steps:

  1. Update the hosts in chef-repo
  2. Run sudo chef-client on all affected hosts
  3. Run sudo gitlab-ctl reconfigure on all affected hosts
  4. Run sudo gitlab-ctl hup unicorn on all affected hosts

With service discovery this is reduced to the following:

  1. Make sure Consul (or any other service that provides a DNS interface) is up-to-date
  2. Wait a little while for the system to take care of things automatically

Does this MR meet the acceptance criteria?

What are the relevant issue numbers?

https://gitlab.com/gitlab-org/gitlab-ee/issues/2042

TODO

  • Write tests
  • Write documentation
  • Double check if everything works as expected when a process is handling requests
  • Think about this for a day or two, to see if there's anything I have overlooked
  • Talk with production (e.g. @northrup) to see if there are any additional requirements
Edited by Nick Thomas

Merge request reports