`Ci::Build.doom!` does not use state machine hooks
Problem
A customer reported an issue via Technical Support where one of their jobs using resource_group
was stuck.
The distribution of job statuses for the pipeline was:
{'waiting_for_resource': 1, 'created': 0, 'pending': 0, 'running': 0, 'failed': 661, 'success': 2258, 'canceled': 1667, 'skipped': 13, 'manual': 0}
So 1 job waiting_for_resource
.
While investigating it appeared that one failed
job was still holding the resource in the group. The failed
job page was showing the message There has been a structural integrity problem detected, please contact system administrator
which is related to failure_reason: :data_integrity_failure
.
This failure reason is set when we call Ci::Build#doom!
as last resort to failing a job that doesn't get dropped normally using Ci::Build#drop
.
Because doom!
doesn't use the state machine, we don't call the code that frees the resource group, or any of the state machine hooks (e.g. after_transition
, before_transition
).
We use doom!
only in rare cases:
- If while assigning the job to a Runner we encounter an unrecoverable error - Example of occurrence in Sentry: https://sentry.gitlab.net/gitlab/gitlabcom/issues/2163274/
- If we can't fail a stuck job normally
Related issues
Freeing the resource group is not the only code that is not called on doom!
. There are other instances:
-
BuildFinishedWorker
is not called. A job may be stuck and have job trace but we won't archive it, nor firing build hooks - the deployment doesn't get dropped
- the job is not auto-retried - this perhaps can be an acceptable side-effect.
- ToDo does not get added to merge request when build fails
-
status update does not trigger
PipelineProcessWorker
- This may be fine most of the time but if the failing job is the last one running in the pipeline, the pipeline status does not get updated, remaining inrunning
state.
cc @calebcooper