Spike: create a job in the `sync` stage and cancel the pipeline if this pipeline is expected to fail
This is a spike issue to research on the idea proposed in https://gitlab.com/gitlab-com/gitlab-OKRs/-/work_items/5805#note_1771943164.
Steps
- Add a job in the
sync
stage, allowing it to collect the job names for the current pipeline.2. Fetch a master broken incident from the past x hour. See if we have any incident that's still open, containing a job name that also appears in the current pipeline. - Fetch the latest finished
master
pipeline, if it failed, and if the current pipeline contains the failed job in the latestmaster
, returntrue
. - If the step above returns true, cancel the pipeline.
- We must skip this job if pipeline:expedite is applied.
- This is currently designed to be added to MR pipelines. I am still debating if this job should be added to the master pipeline. Maybe I shouldn't in the first iteration, but I want to aim for adding it eventually.
Why is this helpful
- Allow MR pipeline to fail faster, raising awareness of on-going master broken incidents and encourage fixing the failure sooner.
Challenge with flaky tests
- One might argue that flaky tests don't warrant us failing everybody's MR as it causes a lot of distraction. But, this ensures that master broken incidents are communicated asap, in order to encourage the incidents is triaged and closed.
- Another use case for auto-quarantine flaky tests.
Challenge with predictive tests
- MR pipelines run selective tests, so even if a spec failed in master, not every MR pipeline runs that specific test. Due to this limitation, we should limit this new job to run only when MR has the pipeline:mr-approved label, because the full pipeline will most likely run that spec which failed in master, if the same job is created.
Challenge with randomized job parallelization
- a spec could be allocated to any of the parallelized job, so when the spec fails in
rspec-unit-pg14-10/15
in master, it may show up inrspec-unit-pg14-15/15
, as an example. If the failed job is a parallelized node, we should perhaps do a few checks to ensure it is still applicable to the current pipeline.
Edited by Jennifer Li