Phase 3: [STG] Serve CI reads from CI standby cluster
Phase 3: Serve CI reads from CI standby cluster
Phase duration: Days to weeks
Summary: In this phase read traffic only for CI data will be served from the CI database. We require a way to share a primary write connection while using a separate read replica.
Infrastructure requirements:
- Enable reads from CI replicas on staging
Application requirements:
-
Ability to share primary write connection while separate read replicas configured #341451 (closed) -
Connection pools of all configured databases are properly sized #333411 (closed) -
Safe rollout and rollback/feature flag option for enabling reads from CI read replicas #342487 (closed) -
One bug outstanding: The use_model_load_balancing
results in a wrong sticking context used: !73949 (merged)
-
Optional additions:
-
Add QueryAnalyzers::GitlabSchemasMetrics
as a way to observe used schemas: !73839 (merged)
Configuration expectations
- The
GITLAB_LOAD_BALANCING_REUSE_PRIMARY_ci=main
is configured to enable re-use of primary connection when accessingmain
orci
- The
GITLAB_MULTIPLE_DATABASE_METRICS=true
is configured to enable Prometheus metrics to includedb_config_name
- A
main:
andci:
is configured in staging GitLab - The
main:
andci:
share all configuration (pointing to the same primary database) exceptload_balancing:
(pointing to different replica database) - At this point feature will be configured, but not yet enabled explicitly. As such the application will open additional connections to CI replica hosts and small amount of traffic will be observed related to reading replication lag. This will be visible in SQL logs with the presence of
db_config_name: ci_replica
, and in Prometheus metrics indicating this withdb_config_name
- Before the enabling the feature flag
use_model_load_balancing
the configuration needs to be rolled out to all hosts. This is expected due to limitations in resolving sticking context. If noci:
is present it might result in a problem described in this MR fixing a bug: !73949 (merged) - Enable percentage rollout of
use_model_load_balancing
to enable small percentage of requests to useci_replica
Phase 3: Serve CI reads from CI standby cluster
Phase duration: Days to weeks
Summary: In this phase read traffic only for CI data will be served from the CI database. We require a way to share a primary write connection while using a separate read replica.
1. Configured environment variables:
-
GITLAB_LOAD_BALANCING_REUSE_PRIMARY_ci=main
to makemain/ci:
to share the same primary connection -
GITLAB_MULTIPLE_DATABASE_METRICS=true
to enabledb_config_name
in Prometheus metrics to indicate used database
2. Configured multiple databases yml
# Expected `config/database.yml`.
# The `main/ci:` share all parameters except `load_balancing:`.
# `load_balancing:` is unique per `main/ci:` and they point to different consul/hosts of replicas.
production:
main:
adapter: postgresql
database: gitlabhq_production
host: postgres-main
load_balancing:
hosts: [postgres-main-replica]
ci:
adapter: postgresql
database: gitlabhq_production
host: postgres-main
load_balancing:
hosts: [postgres-ci-replica]
This is generated by CNG in GitLab Rails container from this values: https://gitlab.com/gitlab-org/charts/gitlab/-/blob/master/examples/database/values-decomposition.yaml
3. Rollout plan:
3.1. Console node rollout
The purpose of console node rollout is to validate that application is correctly configured and can talk to many databases.
-
Configure GITLAB_LOAD_BALANCING_REUSE_PRIMARY_ci=main
-
Configure GITLAB_MULTIPLE_DATABASE_METRICS=true
-
Configure CNG for console node to enable multiple databases -
Start Rails Console
and run a set of validation commands that application can talk to many databases
3.1.1. Validation Commands
-
Simple checks if application sees a proper configuration. Expected: ci load balancer and ci_replica for read connection [1] pry(main)> ApplicationRecord.load_balancer.name => :main [2] pry(main)> Ci::ApplicationRecord.load_balancer.name => :ci [3] pry(main)> ApplicationRecord.connection.pool.db_config.name => "main" [4] pry(main)> Ci::ApplicationRecord.connection.pool.db_config.name => "main" [5] pry(main)> Ci::ApplicationRecord.load_balancer.read { |connection| connection.pool.db_config.name } => "ci_replica" [6] Ci::ApplicationRecord.load_balancer.read_write { |connection| connection.pool.db_config.name } => "main"
-
Simple checks to see if application can talk to additional ci_replica database. Expected: db_config_name:ci_replica [10] pry(main)> ActiveRecord::Base.logger = Logger.new(STDOUT) [11] pry(main)> Ci::ApplicationRecord.load_balancer.read { |connection| connection.select_all("SELECT COUNT(*) FROM ci_instance_variables") } (20.3ms) SELECT COUNT(*) FROM ci_instance_variables /*application:console,db_config_name:ci_replica,line:/data/cache/bundle-2.7.4/ruby/2.7.0/gems/marginalia-1.10.0/lib/marginalia/comment.rb:25:in `block in construct_comment'*/ => #<ActiveRecord::Result:0x00007fcfc79ccdb0 @column_types={}, @columns=["count"], @hash_rows=nil, @rows=[[1]]>
-
Checks if application when use_model_load_balancing
is disabled usesmain_replica
. Expected: db_config_name:main_replica[14] pry(main)> ActiveRecord::Base.logger = Logger.new(STDOUT) [15] pry(main)> Feature.remove(:use_model_load_balancing) [16] pry(main)> RequestStore.begin! [17] pry(main)> RequestStore.clear! => true [18] pry(main)> Ci::ApplicationRecord.connection.select_all("SELECT 1") # expected is to see `db_config_name:main_replica` (0.6ms) SELECT 1 /*application:console,db_config_name:main_replica,line:/data/cache/bundle-2.7.4/ruby/2.7.0/gems/marginalia-1.10.0/lib/marginalia/comment.rb:25:in `block in construct_comment'*/ => #<ActiveRecord::Result:0x00007fcfc7261580 @column_types={}, @columns=["?column?"], @hash_rows=nil, @rows=[[1]]>
-
Checks if application when use_model_load_balancing
is enabled usesci_replica
. Expected: db_config_name:ci_replica[19] pry(main)> ActiveRecord::Base.logger = Logger.new(STDOUT) [20] pry(main)> Feature.enable(:use_model_load_balancing) [21] pry(main)> RequestStore.begin! [22] pry(main)> RequestStore.clear! => true [23] pry(main)> Ci::ApplicationRecord.connection.select_all("SELECT 1") # expected is to see `db_config_name:ci_replica` (0.4ms) SELECT 1 /*application:console,db_config_name:ci_replica,line:/data/cache/bundle-2.7.4/ruby/2.7.0/gems/marginalia-1.10.0/lib/marginalia/comment.rb:25:in `block in construct_comment'*/ => #<ActiveRecord::Result:0x00007fcfc67a97c8 @column_types={}, @columns=["?column?"], @hash_rows=nil, @rows=[[1]]>
-
Cleanup state of feature flag. [25] pry(main)> Feature.remove(:use_model_load_balancing)
3.2. All nodes configured
The purpose of all nodes being configured is to rollout all configuration changes (multiple databases and environment variables) to all nodes without changing feature flag yet. We do expect a very small amount of requests to use ci_replica
at this point. The requests are checking WAL
replication lag.
-
Configure all nodes with GITLAB_LOAD_BALANCING_REUSE_PRIMARY_ci=main
-
Configure all nodes with GITLAB_MULTIPLE_DATABASE_METRICS=true
-
Configure all nodes to enable multiple databases -
Observer logs
andprometheus
metrics for errors -
Update Grafana metrics, update ELK to index new columns
3.2.1. Observable logs
All logs will split db_*_count
metrics into separate buckets describing each used connection:
db_main_*
db_replica_main_*
-
db_replica_ci_*
: we expect some small amount of request to be> 0
, that is equal todb_replica_ci_wal_count
- we would still see entries for
db_ci_*
, but all values should be== 0
Puma logs:
{"method":"GET","path":"/","format":"html","controller":"RootController","action":"index","status":302,"location":"http://gitlab-mbp.home:3000/users/sign_in","time":"2021-11-08T16:31:08.483Z","params":[],"correlation_id":"01FM0673X6885B1RA9FHS89YWZ","meta.caller_id":"RootController#index","meta.remote_ip":"10.0.2.2","meta.feature_category":"projects","meta.client_id":"ip/10.0.2.2","remote_ip":"10.0.2.2","ua":"Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/95.0.4638.69 Safari/537.36","queue_duration_s":0.316837,"request_urgency":"default","target_duration_s":1,"redis_calls":16,"redis_duration_s":0.005204,"redis_read_bytes":261,"redis_write_bytes":25691,"redis_cache_calls":16,"redis_cache_duration_s":0.005204,"redis_cache_read_bytes":261,"redis_cache_write_bytes":25691,"db_count":11,"db_write_count":0,"db_cached_count":0,"db_replica_count":11,"db_replica_main_count":0,"db_replica_ci_count":0,"db_replica_cached_count":0,"db_replica_main_cached_count":0,"db_replica_ci_cached_count":0,"db_replica_wal_count":0,"db_replica_main_wal_count":0,"db_replica_ci_wal_count":0,"db_replica_wal_cached_count":0,"db_replica_main_wal_cached_count":0,"db_replica_ci_wal_cached_count":0,"db_primary_count":0,"db_primary_main_count":0,"db_primary_ci_count":0,"db_primary_cached_count":0,"db_primary_main_cached_count":0,"db_primary_ci_cached_count":0,"db_primary_wal_count":0,"db_primary_main_wal_count":0,"db_primary_ci_wal_count":0,"db_primary_wal_cached_count":0,"db_primary_main_wal_cached_count":0,"db_primary_ci_wal_cached_count":0,"db_replica_duration_s":0.024,"db_replica_main_duration_s":0.0,"db_replica_ci_duration_s":0.0,"db_primary_duration_s":0.0,"db_primary_main_duration_s":0.0,"db_primary_ci_duration_s":0.0,"cpu_s":3.590795,"mem_objects":5086044,"mem_bytes":240661067,"mem_mallocs":1066189,"mem_total_bytes":444102827,"pid":84,"db_duration_s":0.00779,"view_duration_s":0.0,"duration_s":0.37309}
Sidekiq logs:
sidekiq_1 | {"severity":"INFO","time":"2021-11-08T16:25:16.300Z","retry":0,"queue":"cronjob:elastic_index_bulk_cron","backtrace":true,"version":0,"queue_namespace":"cronjob","args":[],"class":"ElasticIndexBulkCronWorker","jid":"8cc8ec5365d0a57106d81ab1","created_at":"2021-11-08T16:25:14.591Z","meta.caller_id":"Cronjob","meta.feature_category":"global_search","correlation_id":"3dfe27e116d60abef2c444b80d6f3909","worker_data_consistency":"sticky","wal_locations":{"ci":"1/9CC53A98"},"idempotency_key":"resque:gitlab:duplicate:cronjob:elastic_index_bulk_cron:f252f68f3cc1cae1877f9a0e1f5b889102a68a9e335a3a5a9e683c7bdf0507f5","size_limiter":"validated","enqueued_at":"2021-11-08T16:25:14.615Z","job_size_bytes":2,"pid":74,"message":"ElasticIndexBulkCronWorker JID-8cc8ec5365d0a57106d81ab1: done: 1.683235 sec","job_status":"done","scheduling_latency_s":0.001509,"redis_calls":19,"redis_duration_s":0.00569,"redis_read_bytes":12,"redis_write_bytes":1862,"redis_queues_calls":1,"redis_queues_duration_s":0.000138,"redis_queues_read_bytes":10,"redis_queues_write_bytes":360,"redis_shared_state_calls":18,"redis_shared_state_duration_s":0.005552,"redis_shared_state_read_bytes":2,"redis_shared_state_write_bytes":1502,"db_count":1,"db_write_count":0,"db_cached_count":0,"db_replica_count":1,"db_replica_main_count":0,"db_replica_ci_count":0,"db_replica_cached_count":0,"db_replica_main_cached_count":0,"db_replica_ci_cached_count":0,"db_replica_wal_count":0,"db_replica_main_wal_count":0,"db_replica_ci_wal_count":0,"db_replica_wal_cached_count":0,"db_replica_main_wal_cached_count":0,"db_replica_ci_wal_cached_count":0,"db_primary_count":0,"db_primary_main_count":0,"db_primary_ci_count":0,"db_primary_cached_count":0,"db_primary_main_cached_count":0,"db_primary_ci_cached_count":0,"db_primary_wal_count":0,"db_primary_main_wal_count":0,"db_primary_ci_wal_count":0,"db_primary_wal_cached_count":0,"db_primary_main_wal_cached_count":0,"db_primary_ci_wal_cached_count":0,"db_replica_duration_s":0.002,"db_replica_main_duration_s":0.0,"db_replica_ci_duration_s":0.0,"db_primary_duration_s":0.0,"db_primary_main_duration_s":0.0,"db_primary_ci_duration_s":0.0,"cpu_s":0.011537,"mem_objects":6405,"mem_bytes":1782888,"mem_mallocs":3317,"mem_total_bytes":2039088,"extra.elastic_index_bulk_cron_worker.records_count":0,"duration_s":1.683235,"completed_at":"2021-11-08T16:25:16.300Z","load_balancing_strategy":"replica","db_duration_s":0.001186}
3.2.2. Observable prometheus metrics
A number of metrics will receive db_config_name
. This will indicate which DB connection was used, which can be: main
, main-replica
, ci-replica
.
gitlab_transaction_db_primary_count_total{db_config_name="main"}
gitlab_transaction_db_primary_cached_count_total{db_config_name="main"}
gitlab_transaction_db_replica_count_total{db_config_name="main-replica|ci-replica}
gitlab_transaction_db_replica_cached_count_total{db_config_name="main-replica|ci-replica}
use_model_load_balancing
3.3. Rollout The purpose of this step is to actually rollout CI traffic to use dedicated CI replicas with a dedicated feature flag that can be used for the purpose of percentage rollout.
-
Enable 0.01%
foruse_model_load_balancing
FF using ChatOps for staging -
Monitor all metrics from 3.2.
-
Enable 1%
foruse_model_load_balancing
-
Enable 50%
foruse_model_load_balancing