"Reference not found' error for merge trains with delayed execution after merge
Problem
When retrying a child, merge train pipeline that has failed with strategy: depends
, the trigger job fails with 'Reference not found'.
We find that the reference no longer exists because the train's MergeTrains::Car
has been removed already(that happens when the train fails).
We may also see this user error for manual jobs in child pipelines. There are a few different calling classes in the gitlab.com logs and only some of them appear to be retries:
https://log.gprd.gitlab.net/app/r/s/0vBPd
Config
https://gitlab.com/allison.browne/child-train-retry
What causes it
We have one train ref per MR and that train ref is cleaned up(deleted) when the pipeline is removed from the train, which happens when the pipeline is marked as failed/success.
Proposal
This may be a simple as removing the train reference cleanup line from when the train is deleted:
This is safe to do because the refs are cleaned up, anyways, every 14 days after a merge or when the merge request is closed.
2 Options
- If we want to allow for delayed execution of child pipeline jobs off of a train, which do not effect the ref status and merge-ability (manual jobs, retries) then we need to be able to create more than 1 train ref per MR. We still need to dis-allow delayed job execution that would effect the pipeline status (since the correct approach would be to re-add it to the train).
- Step 1:
- Disallow pipeline updates after the train is complete, unless the job will not effect the parent pipelines status.
- child pipeline retry with
strategy: depend
(disallow) - child pipeline retry without
strategy: depend
(allow) - child pipeline Manual job that effects the pipeline status (disalllow)? (confirm behavior)
- child pipeline Manual job that does not effect the pipeline status (allow)? (confirm behavior)
- Manual jobs with on non child pipelines? (confirm behavior)
- child pipeline retry with
- Disallow pipeline updates after the train is complete, unless the job will not effect the parent pipelines status.
- Step 2:
- Introduce a new train ref that includes the commit sha
- ref style:
refs/merge_requests/#{merge_request.iid}/#{commit_sha}/train
- Legacy train refs will still need to be supported for older pipelines and we will still see the error on those. Perhaps we can introduce a better error message?
- We can either detect the legacy style refs via a regex or create both styles for 1 release and then remove the creation of the old style refs.
- ref style:
- Introduce a new train ref that includes the commit sha
- Step 3:
- Rework reference deletion on train removal
- Remove reference cleanup when the train is removed or merged https://gitlab.com/gitlab-org/gitlab/-/blob/79e2f6fbedbaf9bade0c8280643773a9b9414c2e/ee/app/models/merge_trains/car.rb#L27
- Rework the logic in
MergeRequests::CleanupRefWorker
(which cleanups the ref only 14 days after the MR is closed) to delete train refs of the new style:refs/merge_requests/#{merge_request.iid}/#{commit_sha}/train
and the old style.
- Rework reference deletion on train removal
Another option is to dis-allow all delayed job execution, but it depends how customers want to use it.
Is this something customers want to be able to do?
I do see this error, for children about 1500 times in the last week.
Which means customers are trying to run child pipeline delayed job executions off of merge trains.
Preference is for option 1, we don't document that this is un-supported, so it reads as a bug. However, this is the more complex engineering effort.