Improve observability around background job scheduling latency
When a background sidekiq job is scheduled to execute, it is logged in structured logging with a job_status=start
event.
This log line does not include the amount of time it took for the job to be scheduled.
Having this field would make diagnosing certain issues much simpler.
Currently, I am using the following workaround using scripted fields in the Kibana Dev Tools Console:
GET /pubsub-sidekiq-inf-gprd*/_search?pretty
{
"query": {
"bool": {
"must": [
{
"match_all": {}
},
{
"match_phrase": {
"json.job_status": "start"
}
},
{
"range": {
"@timestamp": {
"gte": "now-24h",
"lte": "now"
}
}
}
],
"must_not": []
}
},
"script_fields": {
"queuing_time": {
"script": {
"lang": "painless",
"source": "(doc['json.time'].value - doc['json.enqueued_at'].value) / 1000"
}
}
},
"_source": [ "json.*" ],
"sort": [
{
"_script": {
"type": "number",
"script": {
"lang": "painless",
"source": "(doc['json.time'].value - doc['json.enqueued_at'].value) / 1000"
},
"order": "desc"
}
}
]
}
I have also been unable to find a Prometheus metric for this value. If we do not have one, we should. It should be a histogram with a much wider set of values than the default histogram bucket sizes we tend to use.
cc @bjk-gitlab for any knowledge on existing sidekiq latency metrics.