Skip to content

Fix double counting Sidekiq queue events [RUN ALL RSPEC] [RUN AS-IF-FOSS]

What does this MR do?

In those circumstances where a Sidekiq job is enqueued in the future e.g. via perform_in, the Sidekiq middlware counting job enqueue events is being called twice: first for the initial call to perform_in (client-side), and again when Sidekiq actually queues it for execution after the elapsed delay (server-side).

This means that in Prometheus, the queuing counter for those jobs is effectively doubled.

We now add a new label scheduling (delayed|immediate) when the at field is encountered so that we can break up the metrics (and dashboards) based on it.

This will allow us to discern between jobs that are merely scheduled to run vs those that are about to run in dahboards. For instance, our query behind this graph can change to:

sum by (environment,tier,type,stage,shard,queue,feature_category,urgency,worker) (
        rate(sidekiq_enqueued_jobs_total{scheduling="immediate"}[5m])
      )

to filter out those jobs.

Screenshots (strongly suggested)

To test this I modified the Chaos::CpuSpinWorker to use data_consistency :delayed.

I verified that both queue events via perform_async and perform_in apply the correct label values. You can see this in the screenshot below in the instance label, since queue events from perform_in that execute in the future will be triggered server-side and hence be scraped from the sidekiq target, whereas the perform_async job I triggered through the /chaos/cpu_spin endpoint i.e. via the web app.

Note that I changed the label name from delayed=true|false to scheduling=delayed|immediate but that has no bearing on the functionality.

Screenshot_from_2021-06-24_15-25-45

Does this MR meet the acceptance criteria?

Conformity

Availability and Testing

Related to #333671 (closed)

Edited by Matthias Käppler

Merge request reports