Termination of an ephemeral pod can strand job retries until another job is given to the same ephemeral pod
Similar, but distinct to the failure case described in gitlab#390645 (closed), a Premium SaaS customer with 8,554 seats is reporting the following (zendesk case - gitlab internal):
Bug behavior:
A job running on ephemeral pod, on a tagged runner with the gitlab-operator. If the ephemeral pod is terminated, runner system failure is observed on the job. The job gets retried automatically, but gets stuck (queued up) even though the runner is available. The stuck job will proceed to run only after a new job (from same project) or any other job from different projects lands on the tagged runner's ephemeral pod.
Expected behavior:
The job should be retried immediately on the next available ephemeral pod.