Phase 3: [STG] Serve CI reads from CI standby cluster

Phase 3: Serve CI reads from CI standby cluster

Phase duration: Days to weeks

Summary: In this phase read traffic only for CI data will be served from the CI database. We require a way to share a primary write connection while using a separate read replica.

Infrastructure requirements:

Enable reads from CI replicas on staging

Application requirements:

Ability to share primary write connection while separate read replicas configured #341451 (closed)
Connection pools of all configured databases are properly sized #333411 (closed)
Safe rollout and rollback/feature flag option for enabling reads from CI read replicas #342487 (closed)
- One bug outstanding: The use_model_load_balancing results in a wrong sticking context used: !73949 (merged)

Optional additions:

Add QueryAnalyzers::GitlabSchemasMetrics as a way to observe used schemas: !73839 (merged)

Configuration expectations

The GITLAB_LOAD_BALANCING_REUSE_PRIMARY_ci=main is configured to enable re-use of primary connection when accessing main or ci
The GITLAB_MULTIPLE_DATABASE_METRICS=true is configured to enable Prometheus metrics to include db_config_name
A main: and ci: is configured in staging GitLab
The main: and ci: share all configuration (pointing to the same primary database) except load_balancing: (pointing to different replica database)
At this point feature will be configured, but not yet enabled explicitly. As such the application will open additional connections to CI replica hosts and small amount of traffic will be observed related to reading replication lag. This will be visible in SQL logs with the presence of db_config_name: ci_replica, and in Prometheus metrics indicating this with db_config_name
Before the enabling the feature flag use_model_load_balancing the configuration needs to be rolled out to all hosts. This is expected due to limitations in resolving sticking context. If no ci: is present it might result in a problem described in this MR fixing a bug: !73949 (merged)
Enable percentage rollout of use_model_load_balancing to enable small percentage of requests to use ci_replica

Phase 3: Serve CI reads from CI standby cluster

Phase duration: Days to weeks

Summary: In this phase read traffic only for CI data will be served from the CI database. We require a way to share a primary write connection while using a separate read replica.

1. Configured environment variables:

GITLAB_LOAD_BALANCING_REUSE_PRIMARY_ci=main to make main/ci: to share the same primary connection
GITLAB_MULTIPLE_DATABASE_METRICS=true to enable db_config_name in Prometheus metrics to indicate used database

2. Configured multiple databases yml

# Expected `config/database.yml`. 
# The `main/ci:` share all parameters except `load_balancing:`.
# `load_balancing:` is unique per `main/ci:` and they point to different consul/hosts of replicas.

production:
  main:
    adapter: postgresql
    database: gitlabhq_production
    host: postgres-main
    load_balancing:
      hosts: [postgres-main-replica]
  ci:
    adapter: postgresql
    database: gitlabhq_production
    host: postgres-main
    load_balancing:
      hosts: [postgres-ci-replica]

This is generated by CNG in GitLab Rails container from this values: https://gitlab.com/gitlab-org/charts/gitlab/-/blob/master/examples/database/values-decomposition.yaml

3. Rollout plan:

3.1. Console node rollout

The purpose of console node rollout is to validate that application is correctly configured and can talk to many databases.

Configure GITLAB_LOAD_BALANCING_REUSE_PRIMARY_ci=main
Configure GITLAB_MULTIPLE_DATABASE_METRICS=true
Configure CNG for console node to enable multiple databases
Start Rails Console and run a set of validation commands that application can talk to many databases

3.1.1. Validation Commands

Simple checks if application sees a proper configuration. Expected: ci load balancer and ci_replica for read connection

[1] pry(main)> ApplicationRecord.load_balancer.name
=> :main
[2] pry(main)> Ci::ApplicationRecord.load_balancer.name
=> :ci
[3] pry(main)> ApplicationRecord.connection.pool.db_config.name
=> "main"
[4] pry(main)> Ci::ApplicationRecord.connection.pool.db_config.name
=> "main"
[5] pry(main)> Ci::ApplicationRecord.load_balancer.read { |connection| connection.pool.db_config.name }
=> "ci_replica"
[6]  Ci::ApplicationRecord.load_balancer.read_write { |connection| connection.pool.db_config.name }
=> "main"

Simple checks to see if application can talk to additional ci_replica database. Expected: db_config_name:ci_replica

[10] pry(main)> ActiveRecord::Base.logger = Logger.new(STDOUT)
[11] pry(main)> Ci::ApplicationRecord.load_balancer.read { |connection| connection.select_all("SELECT COUNT(*) FROM ci_instance_variables") }
  (20.3ms)  SELECT COUNT(*) FROM ci_instance_variables /*application:console,db_config_name:ci_replica,line:/data/cache/bundle-2.7.4/ruby/2.7.0/gems/marginalia-1.10.0/lib/marginalia/comment.rb:25:in `block in construct_comment'*/
=> #<ActiveRecord::Result:0x00007fcfc79ccdb0 @column_types={}, @columns=["count"], @hash_rows=nil, @rows=[[1]]>

Checks if application when use_model_load_balancing is disabled uses main_replica. Expected: db_config_name:main_replica

[14] pry(main)> ActiveRecord::Base.logger = Logger.new(STDOUT)
[15] pry(main)> Feature.remove(:use_model_load_balancing)
[16] pry(main)> RequestStore.begin!
[17] pry(main)> RequestStore.clear!
=> true
[18] pry(main)> Ci::ApplicationRecord.connection.select_all("SELECT 1")
  # expected is to see `db_config_name:main_replica`
  (0.6ms)  SELECT 1 /*application:console,db_config_name:main_replica,line:/data/cache/bundle-2.7.4/ruby/2.7.0/gems/marginalia-1.10.0/lib/marginalia/comment.rb:25:in `block in construct_comment'*/
=> #<ActiveRecord::Result:0x00007fcfc7261580 @column_types={}, @columns=["?column?"], @hash_rows=nil, @rows=[[1]]>

Checks if application when use_model_load_balancing is enabled uses ci_replica. Expected: db_config_name:ci_replica

[19] pry(main)> ActiveRecord::Base.logger = Logger.new(STDOUT)
[20] pry(main)> Feature.enable(:use_model_load_balancing)
[21] pry(main)> RequestStore.begin!
[22] pry(main)> RequestStore.clear!
=> true
[23] pry(main)> Ci::ApplicationRecord.connection.select_all("SELECT 1")
  # expected is to see `db_config_name:ci_replica`
  (0.4ms)  SELECT 1 /*application:console,db_config_name:ci_replica,line:/data/cache/bundle-2.7.4/ruby/2.7.0/gems/marginalia-1.10.0/lib/marginalia/comment.rb:25:in `block in construct_comment'*/
=> #<ActiveRecord::Result:0x00007fcfc67a97c8 @column_types={}, @columns=["?column?"], @hash_rows=nil, @rows=[[1]]>

Cleanup state of feature flag.

[25] pry(main)> Feature.remove(:use_model_load_balancing)

3.2. All nodes configured

The purpose of all nodes being configured is to rollout all configuration changes (multiple databases and environment variables) to all nodes without changing feature flag yet. We do expect a very small amount of requests to use ci_replica at this point. The requests are checking WAL replication lag.

Configure all nodes with GITLAB_LOAD_BALANCING_REUSE_PRIMARY_ci=main
Configure all nodes with GITLAB_MULTIPLE_DATABASE_METRICS=true
Configure all nodes to enable multiple databases
Observer logs and prometheus metrics for errors
Update Grafana metrics, update ELK to index new columns

3.2.1. Observable logs

All logs will split db_*_count metrics into separate buckets describing each used connection:

db_main_*
db_replica_main_*
db_replica_ci_*: we expect some small amount of request to be > 0, that is equal to db_replica_ci_wal_count
we would still see entries for db_ci_*, but all values should be == 0

Puma logs:

{"method":"GET","path":"/","format":"html","controller":"RootController","action":"index","status":302,"location":"http://gitlab-mbp.home:3000/users/sign_in","time":"2021-11-08T16:31:08.483Z","params":[],"correlation_id":"01FM0673X6885B1RA9FHS89YWZ","meta.caller_id":"RootController#index","meta.remote_ip":"10.0.2.2","meta.feature_category":"projects","meta.client_id":"ip/10.0.2.2","remote_ip":"10.0.2.2","ua":"Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/95.0.4638.69 Safari/537.36","queue_duration_s":0.316837,"request_urgency":"default","target_duration_s":1,"redis_calls":16,"redis_duration_s":0.005204,"redis_read_bytes":261,"redis_write_bytes":25691,"redis_cache_calls":16,"redis_cache_duration_s":0.005204,"redis_cache_read_bytes":261,"redis_cache_write_bytes":25691,"db_count":11,"db_write_count":0,"db_cached_count":0,"db_replica_count":11,"db_replica_main_count":0,"db_replica_ci_count":0,"db_replica_cached_count":0,"db_replica_main_cached_count":0,"db_replica_ci_cached_count":0,"db_replica_wal_count":0,"db_replica_main_wal_count":0,"db_replica_ci_wal_count":0,"db_replica_wal_cached_count":0,"db_replica_main_wal_cached_count":0,"db_replica_ci_wal_cached_count":0,"db_primary_count":0,"db_primary_main_count":0,"db_primary_ci_count":0,"db_primary_cached_count":0,"db_primary_main_cached_count":0,"db_primary_ci_cached_count":0,"db_primary_wal_count":0,"db_primary_main_wal_count":0,"db_primary_ci_wal_count":0,"db_primary_wal_cached_count":0,"db_primary_main_wal_cached_count":0,"db_primary_ci_wal_cached_count":0,"db_replica_duration_s":0.024,"db_replica_main_duration_s":0.0,"db_replica_ci_duration_s":0.0,"db_primary_duration_s":0.0,"db_primary_main_duration_s":0.0,"db_primary_ci_duration_s":0.0,"cpu_s":3.590795,"mem_objects":5086044,"mem_bytes":240661067,"mem_mallocs":1066189,"mem_total_bytes":444102827,"pid":84,"db_duration_s":0.00779,"view_duration_s":0.0,"duration_s":0.37309}

Sidekiq logs:

sidekiq_1           | {"severity":"INFO","time":"2021-11-08T16:25:16.300Z","retry":0,"queue":"cronjob:elastic_index_bulk_cron","backtrace":true,"version":0,"queue_namespace":"cronjob","args":[],"class":"ElasticIndexBulkCronWorker","jid":"8cc8ec5365d0a57106d81ab1","created_at":"2021-11-08T16:25:14.591Z","meta.caller_id":"Cronjob","meta.feature_category":"global_search","correlation_id":"3dfe27e116d60abef2c444b80d6f3909","worker_data_consistency":"sticky","wal_locations":{"ci":"1/9CC53A98"},"idempotency_key":"resque:gitlab:duplicate:cronjob:elastic_index_bulk_cron:f252f68f3cc1cae1877f9a0e1f5b889102a68a9e335a3a5a9e683c7bdf0507f5","size_limiter":"validated","enqueued_at":"2021-11-08T16:25:14.615Z","job_size_bytes":2,"pid":74,"message":"ElasticIndexBulkCronWorker JID-8cc8ec5365d0a57106d81ab1: done: 1.683235 sec","job_status":"done","scheduling_latency_s":0.001509,"redis_calls":19,"redis_duration_s":0.00569,"redis_read_bytes":12,"redis_write_bytes":1862,"redis_queues_calls":1,"redis_queues_duration_s":0.000138,"redis_queues_read_bytes":10,"redis_queues_write_bytes":360,"redis_shared_state_calls":18,"redis_shared_state_duration_s":0.005552,"redis_shared_state_read_bytes":2,"redis_shared_state_write_bytes":1502,"db_count":1,"db_write_count":0,"db_cached_count":0,"db_replica_count":1,"db_replica_main_count":0,"db_replica_ci_count":0,"db_replica_cached_count":0,"db_replica_main_cached_count":0,"db_replica_ci_cached_count":0,"db_replica_wal_count":0,"db_replica_main_wal_count":0,"db_replica_ci_wal_count":0,"db_replica_wal_cached_count":0,"db_replica_main_wal_cached_count":0,"db_replica_ci_wal_cached_count":0,"db_primary_count":0,"db_primary_main_count":0,"db_primary_ci_count":0,"db_primary_cached_count":0,"db_primary_main_cached_count":0,"db_primary_ci_cached_count":0,"db_primary_wal_count":0,"db_primary_main_wal_count":0,"db_primary_ci_wal_count":0,"db_primary_wal_cached_count":0,"db_primary_main_wal_cached_count":0,"db_primary_ci_wal_cached_count":0,"db_replica_duration_s":0.002,"db_replica_main_duration_s":0.0,"db_replica_ci_duration_s":0.0,"db_primary_duration_s":0.0,"db_primary_main_duration_s":0.0,"db_primary_ci_duration_s":0.0,"cpu_s":0.011537,"mem_objects":6405,"mem_bytes":1782888,"mem_mallocs":3317,"mem_total_bytes":2039088,"extra.elastic_index_bulk_cron_worker.records_count":0,"duration_s":1.683235,"completed_at":"2021-11-08T16:25:16.300Z","load_balancing_strategy":"replica","db_duration_s":0.001186}

3.2.2. Observable prometheus metrics

A number of metrics will receive db_config_name. This will indicate which DB connection was used, which can be: main, main-replica, ci-replica.

gitlab_transaction_db_primary_count_total{db_config_name="main"}
gitlab_transaction_db_primary_cached_count_total{db_config_name="main"}
gitlab_transaction_db_replica_count_total{db_config_name="main-replica|ci-replica}
gitlab_transaction_db_replica_cached_count_total{db_config_name="main-replica|ci-replica}

3.3. Rollout `use_model_load_balancing`

The purpose of this step is to actually rollout CI traffic to use dedicated CI replicas with a dedicated feature flag that can be used for the purpose of percentage rollout.

Enable 0.01% for use_model_load_balancing FF using ChatOps for staging
Monitor all metrics from 3.2.
Enable 1% for use_model_load_balancing
Enable 50% for use_model_load_balancing

4. Schematic overview:

Edited Feb 04, 2022 by Dylan Griffith

Phase 3: [STG] Serve CI reads from CI standby cluster

Phase 3: Serve CI reads from CI standby cluster

Configuration expectations

Phase 3: Serve CI reads from CI standby cluster

1. Configured environment variables:

2. Configured multiple databases yml

3. Rollout plan:

3.1. Console node rollout

3.1.1. Validation Commands

3.2. All nodes configured

3.2.1. Observable logs

3.2.2. Observable prometheus metrics

3.3. Rollout use_model_load_balancing

4. Schematic overview:

3.3. Rollout `use_model_load_balancing`