Fix double counting Sidekiq queue events [RUN ALL RSPEC] [RUN AS-IF-FOSS]
What does this MR do?
In those circumstances where a Sidekiq job is enqueued
in the future e.g. via perform_in
, the Sidekiq middlware counting
job enqueue events is being called twice: first for
the initial call to perform_in
(client-side), and again when Sidekiq
actually queues it for execution after the elapsed delay (server-side).
This means that in Prometheus, the queuing counter for those jobs is effectively doubled.
We now add a new label scheduling
(delayed|immediate
) when the at
field
is encountered so that we can break up the metrics
(and dashboards) based on it.
This will allow us to discern between jobs that are merely scheduled to run vs those that are about to run in dahboards. For instance, our query behind this graph can change to:
sum by (environment,tier,type,stage,shard,queue,feature_category,urgency,worker) (
rate(sidekiq_enqueued_jobs_total{scheduling="immediate"}[5m])
)
to filter out those jobs.
Screenshots (strongly suggested)
To test this I modified the Chaos::CpuSpinWorker to use data_consistency :delayed
.
I verified that both queue events via perform_async
and perform_in
apply the correct label values. You can see this in the screenshot below in the instance
label, since queue events from perform_in
that execute in the future will be triggered server-side and hence be scraped from the sidekiq
target, whereas the perform_async
job I triggered through the /chaos/cpu_spin
endpoint i.e. via the web app.
Note that I changed the label name from delayed=true|false to scheduling=delayed|immediate but that has no bearing on the functionality.
Does this MR meet the acceptance criteria?
Conformity
-
I have included changelog trailers, or none are needed. (Does this MR need a changelog?) -
I have added/updated documentation, or it's not needed. (Is documentation required?) -
I have properly separated EE content from FOSS, or this MR is FOSS only. (Where should EE code go?) -
I have added information for database reviewers in the MR description, or it's not needed. (Does this MR have database related changes?) -
I have self-reviewed this MR per code review guidelines. -
This MR does not harm performance, or I have asked a reviewer to help assess the performance impact. (Merge request performance guidelines) -
I have followed the style guides. -
This change is backwards compatible across updates, or this does not apply.
Availability and Testing
-
I have added/updated tests following the Testing Guide, or it's not needed. (Consider all test levels. See the Test Planning Process.) - [-] I have tested this MR in all supported browsers, or it's not needed.
-
I have informed the Infrastructure department of a default or new setting change per definition of done, or it's not needed.
Related to #333671 (closed)