Skip to content

Expose load balancing DB metrics in Sidekiq logs

Matthias Käppler requested to merge 330643-primary-replica-db-duration into master

What does this MR do?

With this change we start logging db_primary_duration_s and db_replica_duration_s in Sidekiq logs, which are related to Sidekiq database load balancing, i.e. the time we spend in queries on the primary and replicas respectively.

We were already collecting these data, and are already publishing them for Puma, but due to allow-list logic in the Sidekiq instrumentation logger we weren't exposing these for SK.

We ended up deciding to not maintain this allow-list anymore; we generally want to make instrumentation metrics available in all loggers if we have them. The risk of leaking anything unintentionally from an instrumentation module seems low, since its sole purpose is to collect metrics meant to be published. As long as any metrics that do not apply and which might come back empty are filtered out gracefully (such as rack-attack, which wouldn't produce anything meaningful for sidekiq).

Screenshots (strongly suggested)

Structured JSON log output as emitted by a DB chaos job triggered via

[1] pry(main)> Chaos::DbSpinWorker.perform_async 10, 1

Before (or with Sidekiq LB disabled)

{
  "severity": "INFO",
  "time": "2021-05-18T11:55:49.259Z",
  "class": "Chaos::DbSpinWorker",
  "args": [
    "10",
    "1"
  ],
  "retry": 3,
  "queue": "chaos:chaos_db_spin",
  "backtrace": true,
  "version": 0,
  "queue_namespace": "chaos",
  "jid": "cdcabbda14a3aaefe6f5ca24",
  "created_at": "2021-05-18T11:55:39.043Z",
  "correlation_id": "f55c180ece2931568135967fd23453d5",
  "worker_data_consistency": "always",
  "enqueued_at": "2021-05-18T11:55:39.046Z",
  "pid": 6,
  "message": "Chaos::DbSpinWorker JID-cdcabbda14a3aaefe6f5ca24: done: 10.211164 sec",
  "job_status": "done",
  "scheduling_latency_s": 0.002807,
  "job_size_bytes": 325,
  "db_count": 10,
  "db_write_count": 0,
  "db_cached_count": 0,
  "cpu_s": 9.933235,
  "mem_objects": 7320507,
  "mem_bytes": 352012464,
  "mem_mallocs": 7320242,
  "duration_s": 10.211164,
  "completed_at": "2021-05-18T11:55:49.259Z",
  "db_duration_s": 0.008025
}

After (Sidekiq load-balancing enabled)

{
  "severity": "INFO",
  "time": "2021-05-18T11:53:43.480Z",
  "class": "Chaos::DbSpinWorker",
  "args": [
    "10",
    "1"
  ],
  "retry": 3,
  "queue": "chaos:chaos_db_spin",
  "backtrace": true,
  "version": 0,
  "queue_namespace": "chaos",
  "jid": "46f7afb8d56189379d7da8e0",
  "created_at": "2021-05-18T11:53:33.091Z",
  "correlation_id": "f1158bc47eb6c471706744175dbac610",
  "worker_data_consistency": "always",
  "enqueued_at": "2021-05-18T11:53:33.112Z",
  "pid": 6,
  "message": "Chaos::DbSpinWorker JID-46f7afb8d56189379d7da8e0: done: 10.366137 sec",
  "job_status": "done",
  "scheduling_latency_s": 0.001477,
  "job_size_bytes": 325,
  "db_count": 10,
  "db_write_count": 0,
  "db_cached_count": 0,
  "db_replica_count": 0,
  "db_replica_cached_count": 0,
  "db_replica_wal_count": 0,
  "db_primary_count": 10,
  "db_primary_cached_count": 0,
  "db_primary_wal_count": 0,
  "db_primary_duration_s": 0.018,
  "db_replica_duration_s": 0.0,
  "cpu_s": 9.915658,
  "mem_objects": 6881359,
  "mem_bytes": 330766472,
  "mem_mallocs": 6877749,
  "duration_s": 10.366137,
  "completed_at": "2021-05-18T11:53:43.480Z",
  "db_duration_s": 0.009456
}

Does this MR meet the acceptance criteria?

Conformity

Availability and Testing

Related to #330643 (closed)

Edited by Bob Van Landuyt

Merge request reports