Skip to content

Do not checkout connection from pool in DatabaseSampler

What does this MR do and why?

As part of debugging #423945 (comment 1544793516) we were looking into why Rails is constantly timing out when trying to close connections to the database. It keeps trying to close a whole pool of connections until no connections are in use in the pool. We had an arbitrary timeout waiting for the pool to empty but it would always timeout (even in local testing).

The reason we sometimes need to disconnect to the datbase is that we have periodic service discovery looking for the current list of replica databases. Every time that changes in Rails we need to disconnect any previous list of replicas.

In the end we tracked down this DatabaseSampler which is started in a Thread when the application boots up. This thread at some point asks for a connection which checks it out of the connection pool. But it never checks it back in so after startup this connection will forever be wasted by this Thread. It turns out that this DatabaseSampler never really needed a connection from the pool for the specific things it was doing so I refactored the code to just avoid calling .connection and it seems to fix this offence.

Screenshots or screen recordings

Screenshots are required for UI changes, and strongly recommended for all other merge requests.

Before After

How to set up and validate locally

  1. Follow setup instructions in https://gitlab.com/gitlab-org/gitlab-development-kit/-/blob/main/doc/howto/database_load_balancing_with_service_discovery.md
  2. You can spot that this connection is being checked out in many ways. I edited the code in ActiveRecord ::ActiveRecord::ConnectionAdapters::ConnectionPool#connection to write the backtrace when it was happening so I could see where it's called from:
       def connection
         if Thread.current.inspect.include?('database_sampler') || Thread.current.inspect.include?('background_task')
           File.open('log/broken.log', 'a') do |f|
             f.puts("______________________________________")
             f.puts(caller.join("\n"))
             f.puts("======================================")
           end
         end
         @thread_cached_conns[connection_cache_key(current_thread)] ||= checkout
       end
  3. More debugging stuff is in #423945 (comment 1544792096) where I confirmed that this was causing the disconnect timeout

MR acceptance checklist

This checklist encourages us to confirm any changes have been analyzed to reduce risks in quality, performance, reliability, security, and maintainability.

Related to #423945

Edited by Dylan Griffith

Merge request reports