Sidekiq delayed strategy: Handle condition when all replica hosts are lagging behind / unavailable
Summary
When DB Load Balancing has been setup, the Sidekiq middleware may try to use the replica hosts for applicable workers that specify data_consistency :delayed
): https://gitlab.com/gitlab-org/gitlab/blob/9599f8669a9395d5193be1b226b4c136f28f8e90/lib/gitlab/database/load_balancing/sidekiq_client_middleware.rb#L32-51
However, when all replicas appear unusable, a nil
can be returned for load_balancer.host
invocation: https://gitlab.com/gitlab-org/gitlab/blob/9bb48c9b87e82324ad36fd6854739fade0c6d045/lib/gitlab/database/load_balancing/load_balancer.rb#L116-122 and https://gitlab.com/gitlab-org/gitlab/blob/4773f35ab32af759d09ecd9215133c0b13e4ca39/lib/gitlab/database/load_balancing/host_list.rb#L45-77
This leads to such workers continually failing with the following error/trace:
Click to view trace
"error_message": "undefined method `database_replica_location' for nil:NilClass",
"error_class": "NoMethodError",
"error_backtrace": [
"lib/gitlab/database/load_balancing/sidekiq_client_middleware.rb:51:in `wal_location_for'",
"lib/gitlab/database/load_balancing/sidekiq_client_middleware.rb:36:in `block in set_data_consistency_locations!'",
"lib/gitlab/database/load_balancing.rb:29:in `block in each_load_balancer'",
"lib/gitlab/database/load_balancing.rb:28:in `each'",
"lib/gitlab/database/load_balancing.rb:28:in `each_load_balancer'",
"lib/gitlab/database/load_balancing/sidekiq_client_middleware.rb:35:in `set_data_consistency_locations!'",
"lib/gitlab/database/load_balancing/sidekiq_client_middleware.rb:15:in `call'",
"lib/gitlab/application_context.rb:74:in `block in use'",
"lib/gitlab/application_context.rb:74:in `use'",
"lib/gitlab/application_context.rb:27:in `with_context'",
"lib/gitlab/sidekiq_middleware/worker_context/client.rb:28:in `block in call'",
"lib/gitlab/sidekiq_middleware/worker_context.rb:9:in `wrap_in_optional_context'",
"lib/gitlab/sidekiq_middleware/worker_context/client.rb:18:in `call'",
"config/initializers/forbid_sidekiq_in_transactions.rb:38:in `block (2 levels) in <module:NoEnqueueingFromTransactions>'",
"app/workers/concerns/application_worker.rb:92:in `perform_async'",
"app/workers/expire_job_cache_worker.rb:26:in `perform'",
"lib/gitlab/database/load_balancing/sidekiq_server_middleware.rb:24:in `call'",
"lib/gitlab/sidekiq_middleware/duplicate_jobs/strategies/until_executing.rb:16:in `perform'",
"lib/gitlab/sidekiq_middleware/duplicate_jobs/duplicate_job.rb:57:in `perform'",
"lib/gitlab/sidekiq_middleware/duplicate_jobs/server.rb:8:in `call'",
"lib/gitlab/sidekiq_middleware/worker_context.rb:9:in `wrap_in_optional_context'",
"lib/gitlab/sidekiq_middleware/worker_context/server.rb:17:in `block in call'",
"lib/gitlab/application_context.rb:74:in `block in use'",
"lib/gitlab/application_context.rb:74:in `use'",
"lib/gitlab/application_context.rb:27:in `with_context'",
"lib/gitlab/sidekiq_middleware/worker_context/server.rb:15:in `call'",
"lib/gitlab/sidekiq_status/server_middleware.rb:7:in `call'",
"lib/gitlab/sidekiq_versioning/middleware.rb:9:in `call'",
"lib/gitlab/sidekiq_middleware/admin_mode/server.rb:14:in `call'",
"lib/gitlab/sidekiq_middleware/instrumentation_logger.rb:9:in `call'",
"lib/gitlab/sidekiq_middleware/batch_loader.rb:7:in `call'",
"lib/gitlab/sidekiq_middleware/extra_done_log_metadata.rb:7:in `call'",
"lib/gitlab/sidekiq_middleware/request_store_middleware.rb:10:in `block in call'",
"lib/gitlab/with_request_store.rb:17:in `enabling_request_store'",
"lib/gitlab/with_request_store.rb:10:in `with_request_store'",
"lib/gitlab/sidekiq_middleware/request_store_middleware.rb:9:in `call'",
"lib/gitlab/sidekiq_middleware/server_metrics.rb:66:in `block in call'",
"lib/gitlab/sidekiq_middleware/server_metrics.rb:89:in `block in instrument'",
"lib/gitlab/metrics/background_transaction.rb:30:in `run'",
"lib/gitlab/sidekiq_middleware/server_metrics.rb:89:in `instrument'",
"lib/gitlab/sidekiq_middleware/server_metrics.rb:65:in `call'",
"lib/gitlab/sidekiq_middleware/monitor.rb:8:in `block in call'",
"lib/gitlab/sidekiq_daemon/monitor.rb:49:in `within_job'",
"lib/gitlab/sidekiq_middleware/monitor.rb:7:in `call'",
"lib/gitlab/sidekiq_middleware/size_limiter/server.rb:13:in `call'",
"lib/gitlab/sidekiq_logging/structured_logger.rb:19:in `call'"
Steps to reproduce
- Enable database load balancing on a self-managed instance
- Force a lag larger than the thresholds to occur on all the replica hosts
- Check the Sidekiq logs for failures matching the above
Example Project
None, problem is not specific to projects
What is the current bug behavior?
When all replicas are unselected due to replication lag, the delayed worker fails instead of trying the primary
What is the expected correct behavior?
When all replicas are unselected due to replication lag, the delayed worker should try using the primary and not fail
Relevant logs and/or screenshots
Provided in description
Output of checks
This was observed on a GitLab v14.4.4-ee self-managed instance by a customer
Related slack thread behind the investigation: https://gitlab.slack.com/archives/CNZ8E900G/p1639535728363100
Possible fixes
From the Slack discussion:
Sidekiq middleware should handle the case where all replicas are down
See the analysis in #348510 (comment 780978695). If no replicas are available, we should fall back to primary
/cc @tmike