Skip to content

Ensure service discovery runs before secondaries are used

What does this MR do?

We use service discovery to dynamically find the ip addresses to use for postgres secondaries.

Before this change, when Gitlab::Database::LoadBalancing.start_service_discovery ran in config/initializers/load_balancing.rb, it started the first round of service discovery in a background thread. This created an opportunity for initialization to continue, including starting database requests that should use a secondary, before the list of secondaries was populated.

As a result, we see many instances of the no_secondaries_available warning here.

Here is an example log search showing a burst of these queries during application startup for a specific instance: https://log.gprd.gitlab.net/goto/4a4e74e81ed277cbe567da400f284952.

To stop this race condition, this MR simply runs the initial round of service discovery before spawning the worker thread, ensuring that the hosts list is populated before application startup continues.

Relates to #323726 (closed)

Screenshots or Screencasts (strongly suggested)

Before making this change, I ran rails with RAILS_ENV=test and added a log line where the hosts are loaded, along with a 0.2 second sleep before the first service discovery call (because my local dns is extremely fast).

I saw the following (note that the set of queries that run before service discovery starts have always run against the primary and won't cause any errors, since service discovery hasn't been set up yet):

test.log file showing race condition
        (0.2ms)  SELECT VERSION() /*application:test*/
  ↳ lib/gitlab/database/connection.rb:104:in `database_version'
  License Load (0.5ms)  SELECT "licenses".* FROM "licenses" ORDER BY "licenses"."id" DESC LIMIT 100 /*application:test*/
  ↳ lib/gitlab/json_cache.rb:51:in `fetch'
  ApplicationSetting Load (2.7ms)  SELECT "application_settings".* FROM "application_settings" ORDER BY "application_settings"."id" DESC LIMIT 1 /*application:test*/
  ↳ app/models/concerns/cacheable_attributes.rb:19:in `current_without_cache'
Creating scope :group_view_details. Overwriting existing method User.group_view_details.
   (1.3ms)  SELECT "schema_migrations"."version" FROM "schema_migrations" ORDER BY "schema_migrations"."version" ASC /*application:test*/
  ↳ config/initializers/fill_shards.rb:6:in `'
  GeoNode Exists? (0.8ms)  SELECT 1 AS one FROM "geo_nodes" LIMIT 1 /*application:test*/
  ↳ ee/lib/gitlab/geo.rb:49:in `block in enabled?'
   (0.7ms)  SELECT "shards"."name" FROM "shards" /*application:test*/
  ↳ app/models/shard.rb:12:in `populate!'
  ******* Here is where service discovery starts ******
  Gitlab::Database::PostgresPartition Load (4.6ms)  SELECT "postgres_partitions".* FROM "postgres_partitions" WHERE (parent_identifier = concat(current_schema(), '.', 'audit_events')) ORDER BY "postgres_partitions"."name" ASC /*application:test*/
  ↳ lib/gitlab/database/load_balancing/connection_proxy.rb:95:in `block in read_using_load_balancer'
  Feature::FlipperGate Load (0.7ms)  SELECT "feature_gates".* FROM "feature_gates" WHERE "feature_gates"."feature_key" = 'partition_pruning_dry_run' /*application:test*/
  ↳ lib/gitlab/database/load_balancing/connection_proxy.rb:95:in `block in read_using_load_balancer'
  Gitlab::Database::PostgresPartition Load (2.4ms)  SELECT "postgres_partitions".* FROM "postgres_partitions" WHERE (parent_identifier = concat(current_schema(), '.', 'audit_events')) ORDER BY "postgres_partitions"."name" ASC /*application:test*/
  ↳ lib/gitlab/database/load_balancing/connection_proxy.rb:95:in `block in read_using_load_balancer'
  Gitlab::Database::PostgresPartition Load (3.0ms)  SELECT "postgres_partitions".* FROM "postgres_partitions" WHERE (parent_identifier = concat(current_schema(), '.', 'web_hook_logs')) ORDER BY "postgres_partitions"."name" ASC /*application:test*/
  ↳ lib/gitlab/database/load_balancing/connection_proxy.rb:95:in `block in read_using_load_balancer'
  Gitlab::Database::PostgresPartition Load (2.4ms)  SELECT "postgres_partitions".* FROM "postgres_partitions" WHERE (parent_identifier = concat(current_schema(), '.', 'web_hook_logs')) ORDER BY "postgres_partitions"."name" ASC /*application:test*/
  ↳ lib/gitlab/database/load_balancing/connection_proxy.rb:95:in `block in read_using_load_balancer'
  Gitlab::Database::PostgresPartition Load (2.7ms)  SELECT "postgres_partitions".* FROM "postgres_partitions" WHERE (parent_identifier = concat(current_schema(), '.', 'incident_management_pending_alert_escalations')) ORDER BY "postgres_partitions"."name" ASC /*application:test*/
  ↳ lib/gitlab/database/load_balancing/connection_proxy.rb:95:in `block in read_using_load_balancer'
  Gitlab::Database::PostgresPartition Load (2.2ms)  SELECT "postgres_partitions".* FROM "postgres_partitions" WHERE (parent_identifier = concat(current_schema(), '.', 'incident_management_pending_alert_escalations')) ORDER BY "postgres_partitions"."name" ASC /*application:test*/
  ↳ lib/gitlab/database/load_balancing/connection_proxy.rb:95:in `block in read_using_load_balancer'
Raven 3.1.2 configured not to capture errors: DSN not set
********* Here is where service discovery completes ********
REPLACING HOSTS
Creating scope :without_statuses. Overwriting existing method CommitStatus.without_statuses.
Creating scope :in_pipelines. Overwriting existing method Ci::Build.in_pipelines.

After making this change, I repeated the same procedure, to show that the race condition is fixed:

test.log file showing race condition fixed
   (0.2ms)  SELECT VERSION() /*application:test*/
  ↳ lib/gitlab/database/connection.rb:104:in `database_version'
  License Load (0.4ms)  SELECT "licenses".* FROM "licenses" ORDER BY "licenses"."id" DESC LIMIT 100 /*application:test*/
  ↳ lib/gitlab/json_cache.rb:51:in `fetch'
  ApplicationSetting Load (3.1ms)  SELECT "application_settings".* FROM "application_settings" ORDER BY "application_settings"."id" DESC LIMIT 1 /*application:test*/
  ↳ app/models/concerns/cacheable_attributes.rb:19:in `current_without_cache'
Creating scope :group_view_details. Overwriting existing method User.group_view_details.
   (2.0ms)  SELECT "schema_migrations"."version" FROM "schema_migrations" ORDER BY "schema_migrations"."version" ASC /*application:test*/
  ↳ config/initializers/fill_shards.rb:6:in `'
  GeoNode Exists? (0.8ms)  SELECT 1 AS one FROM "geo_nodes" LIMIT 1 /*application:test*/
  ↳ ee/lib/gitlab/geo.rb:49:in `block in enabled?'
   (0.7ms)  SELECT "shards"."name" FROM "shards" /*application:test*/
  ↳ app/models/shard.rb:12:in `populate!'
  *********** Here service discovery both starts and completes (since it runs in the main thread) **************
REPLACING HOSTS
  Gitlab::Database::PostgresPartition Load (4.9ms)  SELECT "postgres_partitions".* FROM "postgres_partitions" WHERE (parent_identifier = concat(current_schema(), '.', 'audit_events')) ORDER BY "postgres_partitions"."name" ASC /*application:test*/
  ↳ lib/gitlab/database/load_balancing/connection_proxy.rb:95:in `block in read_using_load_balancer'
  Feature::FlipperGate Load (0.8ms)  SELECT "feature_gates".* FROM "feature_gates" WHERE "feature_gates"."feature_key" = 'partition_pruning_dry_run' /*application:test*/
  ↳ lib/gitlab/database/load_balancing/connection_proxy.rb:95:in `block in read_using_load_balancer'
  CACHE Gitlab::Database::PostgresPartition Load (0.0ms)  SELECT "postgres_partitions".* FROM "postgres_partitions" WHERE (parent_identifier = concat(current_schema(), '.', 'audit_events')) ORDER BY "postgres_partitions"."name" ASC
  ↳ lib/gitlab/database/load_balancing/connection_proxy.rb:95:in `block in read_using_load_balancer'
  Gitlab::Database::PostgresPartition Load (2.9ms)  SELECT "postgres_partitions".* FROM "postgres_partitions" WHERE (parent_identifier = concat(current_schema(), '.', 'web_hook_logs')) ORDER BY "postgres_partitions"."name" ASC /*application:test*/
  ↳ lib/gitlab/database/load_balancing/connection_proxy.rb:95:in `block in read_using_load_balancer'
  CACHE Gitlab::Database::PostgresPartition Load (0.0ms)  SELECT "postgres_partitions".* FROM "postgres_partitions" WHERE (parent_identifier = concat(current_schema(), '.', 'web_hook_logs')) ORDER BY "postgres_partitions"."name" ASC
  ↳ lib/gitlab/database/load_balancing/connection_proxy.rb:95:in `block in read_using_load_balancer'
  Gitlab::Database::PostgresPartition Load (2.6ms)  SELECT "postgres_partitions".* FROM "postgres_partitions" WHERE (parent_identifier = concat(current_schema(), '.', 'incident_management_pending_alert_escalations')) ORDER BY "postgres_partitions"."name" ASC /*application:test*/
  ↳ lib/gitlab/database/load_balancing/connection_proxy.rb:95:in `block in read_using_load_balancer'
  CACHE Gitlab::Database::PostgresPartition Load (0.0ms)  SELECT "postgres_partitions".* FROM "postgres_partitions" WHERE (parent_identifier = concat(current_schema(), '.', 'incident_management_pending_alert_escalations')) ORDER BY "postgres_partitions"."name" ASC
  ↳ lib/gitlab/database/load_balancing/connection_proxy.rb:95:in `block in read_using_load_balancer'
Raven 3.1.2 configured not to capture errors: DSN not set
Creating scope :without_statuses. Overwriting existing method CommitStatus.without_statuses.
Creating scope :in_pipelines. Overwriting existing method Ci::Build.in_pipelines.

How to setup and validate locally (strongly suggested)

Enabling service discovery locally is a bit complicated. I took the following steps:

  • brew install dnsmasq (dnsmasq is a local dns nameserver)

  • sudo brew services start dnsmasq to start dnsmasq on localhost:53

  • verify with dig @localhost localhost +short, which should return 127.0.0.1

  • Stop postgresql with gdk stop postgresql

  • restart postgresql so it exposes a tcp port: cd <gdk-root>/postgresql/data && pg_ctl start -D .

  • Change the development entry in config/database.yml to the following (comment out what was there before):

config/database.yml
development:
  main:
    adapter: postgresql
    encoding: unicode
    database: gitlabhq_development
    host: localhost
    port: 5432
    pool: 10
    prepared_statements: false
    variables:
      statement_timeout: 120s
    load_balancing:
      discover:
        nameserver: localhost
        record: localhost # We tell gitlab to use the same database as a secondary, but it doesn't know that.
        record_type: A
        port: 53
        interval: 60
        disconnect_timeout: 120
  • Do any testing you want to do

To undo these changes and get your system back to the way it started, you can do the following:

  • sudo brew services stop dnsmasq && brew uninstall dnsmasq

  • cd <gdk-root>/postgresql/data && pg_ctl stop -D .

  • gdk start postgresql

  • revert the changes to your config/database.yml

Does this MR meet the acceptance criteria?

Conformity

Availability and Testing

Since this change only moves a single round of service discovery to a different thread, I believe it is relatively safe.

Edited by Simon Tomlinson

Merge request reports