Cancel pipelines running indefinitely
Follow up from #399214 (comment 1330978432)
Background
We identified that there were some pipelines stuck in running status forever. Those pipelines usually had jobs failed in data integrity problems: #399214 (comment 1329294872)
We do not know what caused that. Currently it's assumed that there were infrastructure issues caused the data corrupted somehow (in inconsistent state), and the application didn't know how to move these forward
Timeout did not help in this case, because timeout is set on the jobs, not on the pipelines. There were no jobs running at all, thus nothing can progress for that pipeline.
We need to identify and cancel those pipelines, to avoid #399214 (closed) from happening again.
Questions to answer for proposals
- How do we identify those pipelines? A simple query won't help: #399214 (comment 1330978432)
- We need to consider the scenarios that we might retry an old pipeline as well
- Where should we identify those pipelines, and how do we cancel them?
Proposal A: Scheduled scan
A very simple and straightforward idea based on #399214 (comment 1330978432) is:
- A pipeline schedule we run somewhere maybe daily, scan over all the "running" pipelines (potentially only on the default branch and
ruby2
), identify and collect any which are stuck in running for X hours. Cancel them. - A pipeline stuck in running can be identified if the following duration is greater than X hours:
- If the pipeline never had been retried:
- Pipeline duration should be
now - created_at
- Pipeline duration should be
- If the pipeline had been retried:
- Pipeline duration should be
now - last_retried_job.created_at
- Pipeline duration should be
- If the pipeline never had been retried:
Proposal B: Triggered scan
Given that the pipelines we want to know about, are stuck and would no longer change its status, we cannot rely on a specific event from the pipeline in question. So in order to run this, and if we want to run this based on some triggers, we need to trigger this on some other events.
Ideas are:
- Run this in the same pipeline from the scheduled maintenance pipelines
- Pros: It's all self contained. All projects configured with this schedule are protected, including JiHu
- Cons: It's making the pipeline configurations even more complex, and it's against the spirit of &10172 (closed)
- In
triage-ops
, we collect all pipelines when they start running. We also listen on when those pipelines stop. We set timeout for X hours, if we're not hearing back that those pipeline stop within X hours, we know they got stuck.- Pros: This probably opens up some other opportunities to monitor pipelines. I am not too sure yet, though.
- Cons: Much more complex. To make this reliable, we need to make sure we do persist the status of the running pipelines somewhere, or we can lose tracks. So far we don't really have any persistent layers on
triage-ops
, yet.