Skip to content

Avoid using discarded connection pools in load balancer

What does this MR do and why?

For reasons we don't fully understand yet, it appears that it is possible at startup for Puma to get into a state where it repeatedly fails to boot with the error:

ERROR: The hook worker_start_hooks failed with exception (NoMethodError) "undefined method `delete' for nil:NilClass".

It appears that this happens when a child Puma process attempts to use the master's discarded connection pool at startup. Here is what appears to be happening:

  1. The master process starts and initializes a load balancer.
  2. The master process forks.
  3. As a result of the fork, Rails' ForkTracker calls discard! on all connection pools, including the one used by the load balancer.
  4. A child process sets up its own load balancer and initializes a Gitlab::Database::LoadBalancing::Host for each of the configured replicas.
  5. Host attempts to create a Prometheus gauge but first checks whether the application settings have Prometheus metrics enabled.
  6. To perform this check, the load balancer created in the master process is used because the child process is still in the process of setting up its own load balancer.
  7. The load balancer calls Host#refresh_status to check the status of the replicas.
  8. Because ForkTracker already discarded the connection pools, the load balancer hits the undefined method after releasing the connection.

For this to happen, a number of things must have occurred:

  1. The memoization for prometheus_metrics_enabled? had to be cleared for the child process to make another database query. The master process normally sets up this memoization in the 7_prometheus_metrics.rb initializer. UPDATE: This might occur if the checkbox value is disabled in the database. !203888 (merged) fixes this issue.

  2. Normally the load balancer trusts that the host objects have been online for the first replica_check_interval (default 60 s), but this suggests it took longer than 60 seconds between the initialization of HostList and the read attempt.

  3. The process memory store that caches ApplicationSettings has expired.

While it appears difficult to reproduce this behavior, we can artificially create those conditions by modifying the code and settings to simulate the conditions. The fix is to ensure that discarded pools are never used, which allows the application to start properly.

References

Relates to https://gitlab.com/gitlab-com/request-for-help/-/issues/3317

How to set up and validate locally

  1. Enable PostgreSQL to listen on TCP. I hacked my GDK accordingly:
diff --git a/lib/gdk/config.rb b/lib/gdk/config.rb
index 7c4315d69..d2eb7a616 100644
--- a/lib/gdk/config.rb
+++ b/lib/gdk/config.rb
@@ -848,7 +848,7 @@ module GDK
       path(:data_dir) { config.postgresql.dir.join('data') }
       string(:host) { config.postgresql.dir.to_s }
       string(:active_version) { GDK::Postgresql.target_version.to_s }
-      string(:__active_host) { GDK::Postgresql.new.use_tcp? ? config.postgresql.host : '' }
+      string(:__active_host) { GDK::Postgresql.new.use_tcp? ? config.postgresql.host : 'localhost' }
       integer(:max_connections) { 100 }
 
       # Kept for backward compatibility. Use replica:root_directory instead
  1. gdk restart postgresql

  2. Configure your config/database.yml with load balancing. In the main section, insert a load_balancing section. You can just add replicas that use 127.0.0.1:

development:
  main:
    adapter: postgresql
    encoding: unicode
    database: gitlabhq_development
    host: /Users/stanhu/gitlab/gdk-ee/postgresql
    load_balancing: {"hosts":["127.0.0.1","127.0.0.1","127.0.0.1"]}
  1. Set the application_settings_cache_seconds: 0 in gitlab/config.yml.

  2. Apply this diff:

diff --git a/config/initializers/7_prometheus_metrics.rb b/config/initializers/7_prometheus_metrics.rb
index 413b574af971..030983f6c0ed 100644
--- a/config/initializers/7_prometheus_metrics.rb
+++ b/config/initializers/7_prometheus_metrics.rb
@@ -80,6 +80,7 @@ def puma_dedicated_metrics_server?
 
 Gitlab::Cluster::LifecycleEvents.on_worker_start do
   defined?(Gitlab::Metrics.client.reinitialize_on_pid_change) && Gitlab::Metrics.client.reinitialize_on_pid_change
+  raise IOError, "This is to simulate an error"
   logger = Gitlab::AppLogger
   # Since we also run these samplers in the Puma primary, we need to re-create them each time we fork.
   # For Sidekiq, this does not make any difference, since there is no primary.
diff --git a/lib/gitlab/database/load_balancing/host.rb b/lib/gitlab/database/load_balancing/host.rb
index f77eedf265d8..978300c400db 100644
--- a/lib/gitlab/database/load_balancing/host.rb
+++ b/lib/gitlab/database/load_balancing/host.rb
@@ -70,7 +70,7 @@ def initialize(host, load_balancer, port: nil)
             port
           )
           @online = true
-          @last_checked_at = Time.zone.now
+          @last_checked_at = 0
           @lag_time = nil
           @lag_size = nil
 
diff --git a/lib/gitlab/metrics/labkit.rb b/lib/gitlab/metrics/labkit.rb
index 892d48fdeb61..2f26cef001b2 100644
--- a/lib/gitlab/metrics/labkit.rb
+++ b/lib/gitlab/metrics/labkit.rb
@@ -45,7 +45,7 @@ def histogram(name, docstring, base_labels = {}, buckets = ::Prometheus::Client:
 
         def error_detected!
           @prometheus_metrics_enabled = nil
-          client.disable!
+          # client.disable!
         end
 
         def prometheus_metrics_enabled?
  1. Run gdk restart rails-web.
  2. Watch gdk tail rails-web for failures.

Watch the application

MR acceptance checklist

Evaluate this MR against the MR acceptance checklist. It helps you analyze changes to reduce risks in quality, performance, reliability, security, and maintainability.

Edited by Stan Hu

Merge request reports

Loading