Skip to content

The ProcessRequirementsReportsWorker retries on an unrecoverable error

I came here from looking at the error budgets for Certify

The biggest cause of spend is the RequirementsManagement::ProcessRequirementsReportsWorker failing and being retried 3 times. This worker seems to fail about 30% of the runs with an Gitlab::Access::AccessDeniedError. The past 7 days we've spent about 8 hours of compute running these jobs that cannot succeed.

json.class.keyword: Descending json.job_status.keyword: Descending json.error_class.keyword: Descending Count Count percentages Sum of json.duration_s
RequirementsManagement::ProcessRequirementsReportsWorker done Missing 523,534 69.45% 87,031.19
RequirementsManagement::ProcessRequirementsReportsWorker fail Gitlab::Access::AccessDeniedError 230,299 30.55% 30,303.60

(https://log.gprd.gitlab.net/goto/44537aca483b44655c01d987e84e9fa9)

This error is raised from the ProcessTestReportsService. This kind of error does not warrant a Sidekiq retry, since it is unlikely succeed on a next run.

Perhaps we could also skip scheduling this worker from the BuildFinishedWorker if it is going to be a no-op anyway?

Acceptance Criteria

There are two changes to make here:

  1. Don't retry this job if it's unlikely to run on subsequent attempts.
  2. Don't schedule this job in the first place if it won't succeed. A strategy for this is detailed in #338717 (comment 653146872):

So in this particular example, I think we could consider wrapping the 4 checks in a method that we can check before scheduling the job.

In our codebase, we have this kind of pattern using a nested Async class, perhaps we could do something similar here?

Edited by John Hope