Load balancing of database queries
This adds support for load balancing of database queries when PostgreSQL is used (MySQL is not supported). The commit message(s) contain the most details, so please refer to those for an in-depth description of this feature. Some key features worth mentioning:
- Prepared statements are disabled automatically since these don't work well with load balancing
- Failovers/database restarts are handled gracefully. For example, an offline secondary is ignored; while the primary uses a retry mechanism with an exponential backoff
- Load is balanced using a simple round-robin algorithm, without any external dependencies such as Redis
- In the event of no hosts being available a dedicated error (
Gitlab::Database::LoadBalancing::NoHostsAvailable) is raised to make monitoring easier
- After a write a user's requests will use the primary until the secondaries are in sync OR until a timeout expires (30 seconds at the moment)
- Load balancing is not enabled for Sidekiq as this would lead to consistency problems, and Sidekiq mostly performs writes anyway
A hard requirement for load balancing is that all the database hosts point to the right type of database. The host in
config/database.yml must always point to a primary (even after a failover), and the additional hosts must always point to a secondary. This means you'll need to place a load balancer in front of every database, and connect to those load balancers. During a failover the user must take care of re-routing traffic to the right hosts using these load balancers.
Configuring the hosts is currently done using an environment variable (
LOAD_BALANCE_DATABASE_HOSTS). This removes the need for making any Omnibus changes, or any extra tables/columns in the database.
Related issue: gitlab-com/infrastructure#259
- Test load distribution on staging
Test failoversCode wise this is taken care of, but we lack the right infrastructure on GitLab.com to reliably fail over (even before load balancing). This is taken care of separately.
Disable load balancing when running Rake tasks (e.g.
ActiveRecord::Base.connectionfor the primary in HostList, instead of connecting separately. The latter would lead to a 2x increase in the number of connections on the primary. An alternative is to not send read-only queries to the primary, instead only using it for/after writes