Refine load_balancing_strategy metric label (!64999) · Merge requests · GitLab.org / GitLab

Matthias Käppler requested to merge 333670-more-load-balancing-strategy-cases into master Jun 29, 2021

What does this MR do?

We recently introduced database load-balancing support for Sidekiq workers. We currently observe behavior via two metric labels:

worker_data_consistency: what the worker class defines in code (see docs)
load_balancing_strategy: what the server middleware decides to do

These labels are only available when load balancing is enabled.

What this MR changes are two things:

Break down LB strategy into more cases. See the table below for the new value mapping. This will help us measure progress better w.r.t. what LB adoption looks like across all our jobs.
Always set these values. Previously, we would not set some or any of these when for instance a worker did not explicitly declare a data consistency setting. As long as load balancing is enabled in the application, we now always inject these values into the job hash so that they are picked up in application logs and Prometheus metrics.

New values for `load_balancing_strategy`

Lifting this directly from #333670 (closed)

data_consistency	load_balancing_strategy	description
:always	primary	LB N/A; data consistency not set or :always, FF disabled, or not an `ApplicationWorker`
:sticky	replica	At least one replica was ready
:sticky	primary	No replica was ready
:sticky	primary-no-wal	WAL location was not provided
:delayed	replica	At least one replica was ready on 1st attempt
:delayed	retry	No replica was ready on 1st attempt; retry the job
:delayed	replica-retried	At least one replica was ready on 2nd attempt
:delayed	primary	No replica ready on 2nd attempt
:delayed	primary-no-wal	WAL location was not provided

Screenshots (strongly suggested)

I grepped job logs for some of the cases, thought not all, since some of them depend on tricky scenarios such as encountering replication lag at first, but not in subsequent attempts etc. But as a smoke test, I ran:

docker logs -f gl-gck_sidekiq_1 | egrep --line-buffered '^{' | jq 'select(.job_status == "done" or .job_status == "fail") | { cls: .class, wdc: .worker_data_consistency, lbs: .load_balancing_strategy }'

JSON output


{
  "cls": "ExpireBuildArtifactsWorker",
  "wdc": "always",
  "lbs": "primary"
}
{
  "cls": "ElasticIndexBulkCronWorker",
  "wdc": "delayed",
  "lbs": "replica"
}
{
  "cls": "Geo::SidekiqCronConfigWorker",
  "wdc": "always",
  "lbs": "primary"
}
{
  "cls": "UpdateAllMirrorsWorker",
  "wdc": "always",
  "lbs": "primary"
}
{
  "cls": "ElasticIndexBulkCronWorker",
  "wdc": "delayed",
  "lbs": "replica"
}
{
  "cls": "BuildHooksWorker",
  "wdc": "delayed",
  "lbs": "replica"
}
{
  "cls": "BuildHooksWorker",
  "wdc": "delayed",
  "lbs": "replica"
}
{
  "cls": "Chaos::CpuSpinWorker",
  "wdc": "always",
  "lbs": "primary"
}
{
  "cls": "Geo::SidekiqCronConfigWorker",
  "wdc": "always",
  "lbs": "primary"
}
{
  "cls": "UpdateAllMirrorsWorker",
  "wdc": "always",
  "lbs": "primary"
}
{
  "cls": "ElasticIndexBulkCronWorker",
  "wdc": "delayed",
  "lbs": "replica"
}
{
  "cls": "ScheduleMergeRequestCleanupRefsWorker",
  "wdc": null,
  "lbs": "primary"
}
{
  "cls": "Geo::SidekiqCronConfigWorker",
  "wdc": null,
  "lbs": "primary"
}
{
  "cls": "UpdateAllMirrorsWorker",
  "wdc": null,
  "lbs": "primary"
}
{
  "cls": "UserStatusCleanup::BatchWorker",
  "wdc": null,
  "lbs": "primary"
}
{
  "cls": "ElasticIndexBulkCronWorker",
  "wdc": null,
  "lbs": "primary_no_wal"
}
{
  "cls": "IncidentManagement::IncidentSlaExceededCheckWorker",
  "wdc": null,
  "lbs": "primary"
}
{
  "cls": "Geo::SidekiqCronConfigWorker",
  "wdc": "always",
  "lbs": "primary"
}
{
  "cls": "IncidentManagement::IncidentSlaExceededCheckWorker",
  "wdc": "always",
  "lbs": "primary"
}
{
  "cls": "UpdateAllMirrorsWorker",
  "wdc": "always",
  "lbs": "primary"
}
{
  "cls": "ElasticIndexInitialBulkCronWorker",
  "wdc": "delayed",
  "lbs": "replica"
}

The most important change here is that jobs that were not declaring data consistency and default to always now always have these labels attached as well. The cases where data consistency came out as null were those cases where the client middleware did not execute, because a job was de-duped or put back in the queue for another reason. I think that's fine and we can ignore those cases.

A prometheus excerpt:

Does this MR meet the acceptance criteria?

Conformity

I have included changelog trailers, or none are needed. (Does this MR need a changelog?)
I have added/updated documentation, or it's not needed. (Is documentation required?)
I have properly separated EE content from FOSS, or this MR is FOSS only. (Where should EE code go?)
I have added information for database reviewers in the MR description, or it's not needed. (Does this MR have database related changes?)
I have self-reviewed this MR per code review guidelines.
This MR does not harm performance, or I have asked a reviewer to help assess the performance impact. (Merge request performance guidelines)
I have followed the style guides.
This change is backwards compatible across updates, or this does not apply.

Availability and Testing

I have added/updated tests following the Testing Guide, or it's not needed. (Consider all test levels. See the Test Planning Process.)
[-] I have tested this MR in all supported browsers, or it's not needed.
[-] I have informed the Infrastructure department of a default or new setting change per definition of done, or it's not needed.

Edited Jun 30, 2021 by Matthias Käppler

Refine load_balancing_strategy metric label