Skip to content
Snippets Groups Projects

Create "Future of CI Pipeline Processing" design doc

Merged Furkan Ayhan requested to merge fa/ci-pipeline-processing-blueprint into master
All threads resolved!
---
status: proposed
creation-date: "2023-05-15"
authors: [ "@furkanayhan" ]
coach: "@ayufan"
approvers: [ "@jreporter", "@cheryl.li" ]
owning-stage: "~devops::verify"
participating-stages: []
---
# Future of CI Pipeline Processing
## Summary
In GitLab CI, we have some problems with the current architecture and behavior of the pipeline processing.
These problems confuse users and make it hard to understand the pipeline processing, lead to unexpected and complex
behaviors, and make it hard to implement new features. In this blueprint, we will discuss the problems and propose
a new architecture for pipeline processing.
Most of these problems have been discussed before in the
["Restructure CI job when keyword"](https://gitlab.com/groups/gitlab-org/-/epics/6788) epic.
## Motivation
The list of problems is the main motivation for this blueprint.
### Problem 1: The responsibility of the `when` keyword
Right now, the [`when`](../../../ci/yaml/index.md#when) keyword has many responsibilities;
> - `on_success` (default): Run the job only when no jobs in earlier stages fail or have `allow_failure: true`.
> - `on_failure`: Run the job only when at least one job in an earlier stage fails. A job in an earlier stage
> with `allow_failure: true` is always considered successful.
> - `never`: Don't run the job regardless of the status of jobs in earlier stages.
> Can only be used in a [`rules`](../../../ci/yaml/index.md#rules) section or `workflow: rules`.
> - `always`: Run the job regardless of the status of jobs in earlier stages. Can also be used in `workflow:rules`.
> - `manual`: Run the job only when [triggered manually](../../../ci/jobs/job_control.md#create-a-job-that-must-be-run-manually).
> - `delayed`: [Delay the execution of a job](../../../ci/jobs/job_control.md#run-a-job-after-a-delay)
> for a specified duration.
It answers three questions;
- What's required to run? => `on_success`, `on_failure`, `always`
- How to run? => `manual`, `delayed`
- Add to the pipeline? => `never`
As a result, for example; we cannot create a `manual` job with `when: on_failure`.
This can be useful when persona wants to create a job that is only available on failure, but needs to be manually played.
For example; publishing failures to dedicated page or dedicated external service.
### Problem 2: Abuse of the `allow_failure` keyword
We control the blocker behavior of a manual job by the [`allow_failure`](../../../ci/yaml/index.md#allow_failure) keyword.
Actually, it has other responsibilities; _"determine whether a pipeline should continue running when a job fails"_.
Currently, a [manual job](../../../ci/jobs/job_control.md#create-a-job-that-must-be-run-manually);
- is not a blocker when it has `allow_failure: true` (by default)
- a blocker when it has `allow_failure: false`.
As a result, for example; we cannot create a `manual` job that is `allow_failure: false` and not a blocker.
```yaml
job1:
stage: test
when: manual
allow_failure: true # default
job2:
stage: deploy
```
Currently;
- `job1` is skipped.
- `job2` runs because `job1` is ignored since it has `allow_failure: true`.
- When we run/play `job1`;
- if it fails, it's marked as "success with warning".
#### `allow_failure` with `rules`
`allow_failure` becomes more confusing when using `rules`.
From [docs](../../../ci/yaml/index.md#when):
> The default behavior of `allow_failure` changes to true with `when: manual`.
> However, if you use `when: manual` with `rules`, `allow_failure` defaults to `false`.
From [docs](../../../ci/yaml/index.md#allow_failure):
> The default value for `allow_failure` is:
>
> - `true` for manual jobs.
> - `false` for jobs that use `when: manual` inside `rules`.
> - `false` in all other cases.
For example;
```yaml
job1:
script: ls
when: manual
job2:
script: ls
rules:
- if: $ALWAYS_TRUE
when: manual
```
`job1` and `job2` behave differently;
- `job1` is not a blocker because it has `allow_failure: true` by default.
- `job2` is a blocker `rules: when: manual` does not return `allow_failure: true` by default.
### Problem 3: Different behaviors in DAG/needs
The main behavioral difference between DAG and STAGE is about the "skipped" and "ignored" states.
**Background information:**
- skipped:
- When a job is `when: on_success` and its previous status is failed, it's skipped.
- When a job is `when: on_failure` and its previous status is not "failed", it's skipped.
- ignored:
- When a job is `when: manual` with `allow_failure: true`, it's ignored.
**Problem:**
The `skipped` and `ignored` states are considered successful in the STAGE processing but not in the DAG processing.
#### Problem 3.1. Handling of ignored status with manual jobs
**Example 1:**
```yaml
build:
stage: build
script: exit 0
when: manual
allow_failure: true # by default
test:
stage: test
script: exit 0
needs: [build]
```
- `build` is ignored (skipped) because it's `when: manual` with `allow_failure: true`.
- `test` is skipped because "ignored" is not a successful state in the DAG processing.
**Example 2:**
```yaml
build:
stage: build
script: exit 0
when: manual
allow_failure: true # by default
test:
stage: test
script: exit 0
```
- `build` is ignored (skipped) because it's `when: manual` with `allow_failure: true`.
- `test2` runs and succeeds.
#### Problem 3.2. Handling of skipped status with when: on_failure
**Example 1:**
```yaml
build_job:
stage: build
script: exit 1
test_job:
stage: test
script: exit 0
rollback_job:
stage: deploy
needs: [build_job, test_job]
script: exit 0
when: on_failure
```
- `build_job` runs and fails.
- `test_job` is skipped.
- Even though `rollback_job` is `when: on_failure` and there is a failed job, it is skipped because the `needs` list has a "skipped" job.
**Example 2:**
```yaml
build_job:
stage: build
script: exit 1
test_job:
stage: test
script: exit 0
rollback_job:
stage: deploy
script: exit 0
when: on_failure
```
- `build_job` runs and fails.
- `test_job` is skipped.
- `rollback_job` runs because there is a failed job before.
### Problem 4: The skipped and ignored states
Let's assume that we solved the problem 3 and the "skipped" and "ignored" states are not different in DAG and STAGE.
How should they behave in general? Are they successful or not? Should "skipped" and "ignored" be different?
Should we introduce new `when` conditions; `when: on: skipped/ignored/manual`?
#### Problem 4.1. The ignored status with manual jobs
With the newly proposed syntax;
```yaml
build:
stage: build
script: exit 0
when:
on: success
start_in: manual
output:
manual_block: false
test:
stage: test
script: exit 0
```
- `build` is in the "manual" state but considered as "skipped" (ignored) for the pipeline processing.
- `test` runs because "skipped" is a successful state.
Alternatively;
```yaml
build1:
stage: build
script: exit 0
when:
on: success
start_in: manual
output:
manual_block: false
build2:
stage: build
script: exit 0
test:
stage: test
script: exit 0
```
- `build1` is in the "manual" state but considered as "skipped" (ignored) for the pipeline processing.
- `build2` runs and succeeds.
- `test` runs because "success" + "skipped" is a successful state.
#### Problem 4.2. The skipped status with when: on_failure
With the newly proposed syntax;
```yaml
build:
stage: build
script: exit 0
when:
on: failure
test:
stage: test
script: exit 0
```
- `build` is skipped because it's `when: on_failure` and its previous status is not "failed".
- `test` runs because "skipped" is a successful state.
Alternatively;
```yaml
build1:
stage: build
script: exit 0
when:
on: failure
build2:
stage: build
script: exit 0
test:
stage: test
script: exit 0
```
- `build1` is skipped because it's `when: on_failure` and its previous status is not "failed".
- `build2` runs and succeeds.
- `test` runs because "success" + "skipped" is a successful state.
### Problem 5: The `dependencies` keyword
The [`dependencies`](../../../ci/yaml/index.md#dependencies) keyword is used to define a list of jobs to fetch
[artifacts](../../../ci/yaml/index.md#artifacts) from. It is a shared responsibility with the `needs` keyword.
Moreover, they can be used together in the same job. We may not need to discuss all possible scenarios but this example
is enough to show the confusion;
```yaml
test2:
script: exit 0
dependencies: [test1]
needs:
- job: test1
artifacts: false
```
### Information 1: Canceled jobs
Are a canceled job and a failed job the same? They have many differences so we could easily say "no".
However, they have one similarity; they can be "allowed to fail".
Let's define their differences first;
- A canceled job;
- It is not a finished job.
- Canceled is a user requested interruption of the job. The intent is to abort the job or stop pipeline processing as soon as possible.
- We don't know the result, there is no artifacts, etc.
- Since it's never run, the `after_script` is not run.
- Its eventual state is "canceled" so no job can run after it.
- There is no `when: on_canceled`.
- Even `when: always` is not run.
- A failed job;
- It is a machine response of the CI system to executing the job content. It indicates that execution failed for some reason.
- It is equal answer of the system to success. The fact that something is failed is relative,
and might be desired outcome of CI execution, like in when executing tests that some are failing.
- We know the result and [there can be artifacts](../../../ci/yaml/index.md#artifactswhen).
- `after_script` is run.
- Its eventual state is "failed" so subsequent jobs can run depending on their `when` values.
- `when: on_failure` and `when: always` are run.
**The one similarity is; they can be "allowed to fail".**
```yaml
build:
stage: build
script: sleep 10
allow_failure: true
test:
stage: test
script: exit 0
when: on_success
```
- If `build` runs and gets `canceled`, then `test` runs.
- If `build` runs and gets `failed`, then `test` runs.
### Information 2: Empty state
We [recently updated](https://gitlab.com/gitlab-org/gitlab/-/merge_requests/117856) the documentation of
[the `when` keyword](../../../ci/yaml/index.md#when) for clarification;
> - `on_success`: Run the job only when no jobs in earlier stages fail or have `allow_failure: true`.
> - `on_failure`: Run the job only when at least one job in an earlier stage fails.
For example;
```yaml
test1:
when: on_success
script: exit 0
# needs: [] would lead to the same result
test2:
when: on_failure
script: exit 0
# needs: [] would lead to the same result
```
- `test1` runs because there is no job failed in the previous stages.
- `test2` does not run because there is no job failed in the previous stages.
The `on_success` means that "nothing failed", it does not mean that everything succeeded.
The same goes to `on_failure`, it does not mean that everything failed, but does mean that "something failed".
This semantic goes by a expectation that your pipeline succeeds, and this is happy path.
Not that your pipeline fails, because then it requires user intervention to fix it.
## Goals
- The `allow_failure` keyword must only responsible for marking failed jobs as "success with warning".
- This also means that canceled jobs must not be marked as "success with warning".
- The `when` keyword must only answer the question "What's required to run?". And it must be the only source of truth
for deciding if a job should run or not.
- The "skipped" and "ignored" states must be reconsidered.
- The "never" condition must be reconsidered.
- A new keyword structure must be introduced to specify if a job is an "automatic", "manual", or "delayed" job.
- The `needs` keyword must only control the order of the jobs. It must not be used to control the behavior of the jobs
or to decide if a job should run or not.
- The DAG and STAGE behaviors must be the same.
- The `needs` and `dependencies` keywords must not be used together in the same job.
## Non-Goals
We will not discuss how to avoid breaking changes for now.
## Proposal
We discussed some alternative solutions in
the ["Restructure CI job when keyword"](https://gitlab.com/groups/gitlab-org/-/epics/6788) epic. Here, I'd like to
summarize the final proposal.
- `when:on`: Define state requirements to run a job.
- Can be a string or an array of strings.
- Possible values: `success`, `failed`, `always`.
- `when:start_in`:
- Possible values: `immediately`, `manual`, `<time>` (delayed).
- Alternative syntax: `start: immediately`, `start: manually`, `start: in <time>`.
- `output:manual_block`: Define if a manual job is a blocker or not.
- If this is `true`, then no subsequent job can run until this job is finished.
- If this is `false`, then the manual job will be "skipped" and subsequent jobs with `when: on_skipped` can run.
```yaml
test1:
when:
on: success # default
start_in: immediately # default
output:
allow_failure: false # default
test2:
when:
on: success # default
start_in: manual
output:
manual_block: false # default
allow_failure:
exit_codes: [1,2,3]
test3:
when:
on: success # default
start_in: manual
output:
manual_block: true
test4:
when:
on: success # default
start_in: 1 hour
test5:
needs: [test2]
test6:
needs: [test2]
run:
when: [on_success, on_skipped]
```
## Design and implementation details
N/A
## Alternative Solutions
N/A
Loading