Validate a behaviour of custom `LoadBalancing` when using Puma

We had an outage (unrelated to Puma), but showing some Sentry errors: gitlab-com/gl-infra/production#1327 (comment 240852524).

could not obtain a connection from the pool within 5.000 seconds (waited 5.000 seconds); all pooled connections were in use

We do not know exactly the cause of these errors.

However, I noticed that we have a significant amount of custom LoadBalancing code for DB in GitLab EE that might simply be not prepared to run on Puma. The only multi-threaded server that we run up to now is Sidekiq, but Sidekiq seems to be exempt from the LoadBalancing.

This is a potential quite a big problem to be resolved before we can move on with Puma.

The related code is in: https://gitlab.com/gitlab-org/gitlab/blob/master/ee%2Flib%2Fgitlab%2Fdatabase%2Fload_balancing.rb

What we are looking for?

  1. How the read-only connections are handled?
  2. Are the read-only returned to the pool properly?
  3. How the read-write connections are handled?
  4. Are the read-write connections returned to the pool properly?
  5. Can already open connection used by thread A be used by thread B?
  6. Does it affect as well the Puma T1 or Puma T>1?
  7. Why we didn't see this before?
  8. Do we know how many database connections are open from the single host to read-only replicas and read-write?
Edited Nov 06, 2019 by Kamil Trzciński
Assignee Loading
Time tracking Loading