Fix double counting Sidekiq queue events [RUN ALL RSPEC] [RUN AS-IF-FOSS] (!64322) · Merge requests · GitLab.org / GitLab

Matthias Käppler requested to merge 333671-fix-double-count-sidekiq-queue-events into master Jun 17, 2021

What does this MR do?

In those circumstances where a Sidekiq job is enqueued in the future e.g. via perform_in, the Sidekiq middlware counting job enqueue events is being called twice: first for the initial call to perform_in (client-side), and again when Sidekiq actually queues it for execution after the elapsed delay (server-side).

This means that in Prometheus, the queuing counter for those jobs is effectively doubled.

We now add a new label scheduling (delayed|immediate) when the at field is encountered so that we can break up the metrics (and dashboards) based on it.

This will allow us to discern between jobs that are merely scheduled to run vs those that are about to run in dahboards. For instance, our query behind this graph can change to:

sum by (environment,tier,type,stage,shard,queue,feature_category,urgency,worker) (
        rate(sidekiq_enqueued_jobs_total{scheduling="immediate"}[5m])
      )

to filter out those jobs.

Screenshots (strongly suggested)

To test this I modified the Chaos::CpuSpinWorker to use data_consistency :delayed.

I verified that both queue events via perform_async and perform_in apply the correct label values. You can see this in the screenshot below in the instance label, since queue events from perform_in that execute in the future will be triggered server-side and hence be scraped from the sidekiq target, whereas the perform_async job I triggered through the /chaos/cpu_spin endpoint i.e. via the web app.

Note that I changed the label name from delayed=true|false to scheduling=delayed|immediate but that has no bearing on the functionality.

Does this MR meet the acceptance criteria?

Conformity

I have included changelog trailers, or none are needed. (Does this MR need a changelog?)
I have added/updated documentation, or it's not needed. (Is documentation required?)
I have properly separated EE content from FOSS, or this MR is FOSS only. (Where should EE code go?)
I have added information for database reviewers in the MR description, or it's not needed. (Does this MR have database related changes?)
I have self-reviewed this MR per code review guidelines.
This MR does not harm performance, or I have asked a reviewer to help assess the performance impact. (Merge request performance guidelines)
I have followed the style guides.
This change is backwards compatible across updates, or this does not apply.

Availability and Testing

I have added/updated tests following the Testing Guide, or it's not needed. (Consider all test levels. See the Test Planning Process.)
[-] I have tested this MR in all supported browsers, or it's not needed.
I have informed the Infrastructure department of a default or new setting change per definition of done, or it's not needed.

Related to #333671 (closed)

Edited Jun 29, 2021 by Matthias Käppler

Fix double counting Sidekiq queue events [RUN ALL RSPEC] [RUN AS-IF-FOSS]

What does this MR do?

Screenshots (strongly suggested)

Does this MR meet the acceptance criteria?

Conformity

Availability and Testing

Merge request reports