Avoid using discarded connection pools in load balancer
What does this MR do and why?
For reasons we don't fully understand yet, it appears that it is possible at startup for Puma to get into a state where it repeatedly fails to boot with the error:
ERROR: The hook worker_start_hooks failed with exception (NoMethodError) "undefined method `delete' for nil:NilClass".
It appears that this happens when a child Puma process attempts to use the master's discarded connection pool at startup. Here is what appears to be happening:
- The master process starts and initializes a load balancer.
- The master process forks.
- As a result of the fork, Rails'
ForkTrackercallsdiscard!on all connection pools, including the one used by the load balancer. - A child process sets up its own load balancer and initializes a
Gitlab::Database::LoadBalancing::Hostfor each of the configured replicas. -
Hostattempts to create a Prometheus gauge but first checks whether the application settings have Prometheus metrics enabled. - To perform this check, the load balancer created in the master process is used because the child process is still in the process of setting up its own load balancer.
- The load balancer calls
Host#refresh_statusto check the status of the replicas. - Because
ForkTrackeralready discarded the connection pools, the load balancer hits theundefined methodafter releasing the connection.
For this to happen, a number of things must have occurred:
-
The memoization for
prometheus_metrics_enabled?had to be cleared for the child process to make another database query. The master process normally sets up this memoization in the7_prometheus_metrics.rbinitializer. UPDATE: This might occur if the checkbox value is disabled in the database. !203888 (merged) fixes this issue. -
Normally the load balancer trusts that the host objects have been online for the first
replica_check_interval(default 60 s), but this suggests it took longer than 60 seconds between the initialization ofHostListand the read attempt. -
The process memory store that caches
ApplicationSettingshas expired.
While it appears difficult to reproduce this behavior, we can artificially create those conditions by modifying the code and settings to simulate the conditions. The fix is to ensure that discarded pools are never used, which allows the application to start properly.
References
Relates to https://gitlab.com/gitlab-com/request-for-help/-/issues/3317
How to set up and validate locally
- Enable PostgreSQL to listen on TCP. I hacked my GDK accordingly:
diff --git a/lib/gdk/config.rb b/lib/gdk/config.rb
index 7c4315d69..d2eb7a616 100644
--- a/lib/gdk/config.rb
+++ b/lib/gdk/config.rb
@@ -848,7 +848,7 @@ module GDK
path(:data_dir) { config.postgresql.dir.join('data') }
string(:host) { config.postgresql.dir.to_s }
string(:active_version) { GDK::Postgresql.target_version.to_s }
- string(:__active_host) { GDK::Postgresql.new.use_tcp? ? config.postgresql.host : '' }
+ string(:__active_host) { GDK::Postgresql.new.use_tcp? ? config.postgresql.host : 'localhost' }
integer(:max_connections) { 100 }
# Kept for backward compatibility. Use replica:root_directory instead
-
gdk restart postgresql -
Configure your
config/database.ymlwith load balancing. In themainsection, insert aload_balancingsection. You can just add replicas that use127.0.0.1:
development:
main:
adapter: postgresql
encoding: unicode
database: gitlabhq_development
host: /Users/stanhu/gitlab/gdk-ee/postgresql
load_balancing: {"hosts":["127.0.0.1","127.0.0.1","127.0.0.1"]}
-
Set the
application_settings_cache_seconds: 0ingitlab/config.yml. -
Apply this diff:
diff --git a/config/initializers/7_prometheus_metrics.rb b/config/initializers/7_prometheus_metrics.rb
index 413b574af971..030983f6c0ed 100644
--- a/config/initializers/7_prometheus_metrics.rb
+++ b/config/initializers/7_prometheus_metrics.rb
@@ -80,6 +80,7 @@ def puma_dedicated_metrics_server?
Gitlab::Cluster::LifecycleEvents.on_worker_start do
defined?(Gitlab::Metrics.client.reinitialize_on_pid_change) && Gitlab::Metrics.client.reinitialize_on_pid_change
+ raise IOError, "This is to simulate an error"
logger = Gitlab::AppLogger
# Since we also run these samplers in the Puma primary, we need to re-create them each time we fork.
# For Sidekiq, this does not make any difference, since there is no primary.
diff --git a/lib/gitlab/database/load_balancing/host.rb b/lib/gitlab/database/load_balancing/host.rb
index f77eedf265d8..978300c400db 100644
--- a/lib/gitlab/database/load_balancing/host.rb
+++ b/lib/gitlab/database/load_balancing/host.rb
@@ -70,7 +70,7 @@ def initialize(host, load_balancer, port: nil)
port
)
@online = true
- @last_checked_at = Time.zone.now
+ @last_checked_at = 0
@lag_time = nil
@lag_size = nil
diff --git a/lib/gitlab/metrics/labkit.rb b/lib/gitlab/metrics/labkit.rb
index 892d48fdeb61..2f26cef001b2 100644
--- a/lib/gitlab/metrics/labkit.rb
+++ b/lib/gitlab/metrics/labkit.rb
@@ -45,7 +45,7 @@ def histogram(name, docstring, base_labels = {}, buckets = ::Prometheus::Client:
def error_detected!
@prometheus_metrics_enabled = nil
- client.disable!
+ # client.disable!
end
def prometheus_metrics_enabled?
- Run
gdk restart rails-web. - Watch
gdk tail rails-webfor failures.
Watch the application
MR acceptance checklist
Evaluate this MR against the MR acceptance checklist. It helps you analyze changes to reduce risks in quality, performance, reliability, security, and maintainability.