Skip to content

Distinguish job failure in worker processing failures metric

What does this MR do?

Distinguish job failure in worker processing failures metric

With !4001 (merged) we've added few new metrics that show details of what's happening with runner worker and worker slots.

One of the metric is gitlab_runner_worker_processing_failures_total which counts failures on processing the worker.

Currently that metric distinguishes only one specific failure type: no_free_executor. This is a feature of some of Runner executors, that before asking for a job may report whether there is a capacity to handle it or not. When no_free_executor is reported, making a request to GitLab will be abandoned until the capacity is not restored.

Everything else is mixed in the other failure type.

With this commit we're adding the job_failure failure type, which will allow to distinguish processing errors being job failures - which in many cases are an EXPECTED result - from anything else, which may suggest that something wrong is happening with Runner's internal concurrency handling mechanism.

Why was this MR needed?

What's the best way to test this MR?

What are the relevant issue numbers?

Merge request reports