Skip to content

The db-specific replica metrics are not properly stored in logs

Found in: #345118 (comment 749348860).

The db_replica_main/ci_count is always 0 regardless of db_replica_count.

Troubleshooting

  1. Metrics scraping is part of the lib/gitlab/metrics/subscribers/active_record.rb
  2. We increment metrics by calling def increment(counter, db_config_name:, db_role:)
  3. The arguments passed for CI replica connection is def increment(counter, db_config_name: 'ci_replica', db_role: 'replica')
  4. We increment a key log_key = compose_metric_key(counter, db_role, db_config_name) for ex. db_replica_ci_replica_count
  5. Then invocation of the load_balancing_metric_keys returns keys as db_replica_ci_replica_count

Proposal

The usage of db_role is in general redundant everywhere in this class, as db_config_name properly describes all possible configurations. The db_role is a left-over of old times when we had a single DB and had to find if a connection is primary or replica.

Maybe we change metrics to have in this form:

  • db_count = SUM of all metrics for all databases and all roles: use metrics in form of the db_#{db_role}_#{metric} (as-is today)
  • db_(primary/replica)_count - SUM of all metrics for all databases divided by roles: use metrics in form of the db_#{db_role}_#{metric} (as-is today) - I would also consider deprecating those usages
  • db_(main|main_replica|ci|ci_replica)_count - per-database metrics: use metrics in form of the db_#{db_config_name}_#{metric} - Change here

The change will be only on the last one:

  • From today's db_(primary|replica)_(main|ci|main_replica|ci_replica)_count into db_(main|main_replica|ci|ci_replica)_count

Example

I see it as well on development environment:

{"method":"GET","path":"/-/peek/results","format":"json","controller":"Peek::ResultsController","action":"show","status":200,"time":"2021-12-01T14:15:11.893Z","params":[{"key":"request_id","value":"01FNV5JN5G9S79KD3BSEZE0M5Z"}],"correlation_id":"01FNV5JV4EN7Q50AB3HVRRBJDM","meta.user":"root","meta.caller_id":"Peek::ResultsController#show","meta.remote_ip":"10.0.2.2","meta.client_id":"user/1","remote_ip":"10.0.2.2","user_id":1,"username":"root","ua":"Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/96.0.4664.55 Safari/537.36","queue_duration_s":0.118461,"request_urgency":"default","target_duration_s":1,"redis_calls":5,"redis_duration_s":0.002277,"redis_read_bytes":399728,"redis_write_bytes":3107,"redis_cache_calls":2,"redis_cache_duration_s":0.001478,"redis_cache_read_bytes":399547,"redis_cache_write_bytes":2264,"redis_shared_state_calls":2,"redis_shared_state_duration_s":0.000488,"redis_shared_state_write_bytes":104,"redis_sessions_calls":1,"redis_sessions_duration_s":0.000311,"redis_sessions_read_bytes":181,"redis_sessions_write_bytes":739,"db_count":1,"db_write_count":0,"db_cached_count":0,"db_replica_count":1,"db_replica_main_count":0,"db_replica_ci_count":0,"db_replica_cached_count":0,"db_replica_main_cached_count":0,"db_replica_ci_cached_count":0,"db_replica_wal_count":0,"db_replica_main_wal_count":0,"db_replica_ci_wal_count":0,"db_replica_wal_cached_count":0,"db_replica_main_wal_cached_count":0,"db_replica_ci_wal_cached_count":0,"db_primary_count":0,"db_primary_main_count":0,"db_primary_ci_count":0,"db_primary_cached_count":0,"db_primary_main_cached_count":0,"db_primary_ci_cached_count":0,"db_primary_wal_count":0,"db_primary_main_wal_count":0,"db_primary_ci_wal_count":0,"db_primary_wal_cached_count":0,"db_primary_main_wal_cached_count":0,"db_primary_ci_wal_cached_count":0,"db_replica_duration_s":0.004,"db_replica_main_duration_s":0.0,"db_replica_ci_duration_s":0.0,"db_primary_duration_s":0.0,"db_primary_main_duration_s":0.0,"db_primary_ci_duration_s":0.0,"cpu_s":0.106572,"mem_objects":53950,"mem_bytes":5883824,"mem_mallocs":29254,"mem_total_bytes":8041824,"pid":83,"db_duration_s":0.0,"view_duration_s":0.00011,"duration_s":0.00721}

The snippet:

"db_replica_count":1,"db_replica_main_count":0,"db_replica_ci_count":0 

image

Edited by Kamil Trzciński