Furkan Ayhan · Furkan Ayhan · f10992fe · d0528eac · 80b1107a · fa626f69
--- a/doc/architecture/blueprints/ci_pipeline_processing/index.md 0 → 100644

+ 473

− 0
+++ b/doc/architecture/blueprints/ci_pipeline_processing/index.md 0 → 100644

+ 473

− 0
+---
+status: proposed
+creation-date: "2023-05-15"
+authors: [ "@furkanayhan" ]
+coach: "@ayufan"
+approvers: [ "@jreporter", "@cheryl.li" ]
+owning-stage: "~devops::verify"
+participating-stages: []
+---
+
+# Future of CI Pipeline Processing
+
+## Summary
+
+In GitLab CI, we have some problems with the current architecture and behavior of the pipeline processing.
+These problems confuse users and make it hard to understand the pipeline processing, lead to unexpected and complex
+behaviors, and make it hard to implement new features. In this blueprint, we will discuss the problems and propose
+a new architecture for pipeline processing.
+
+Most of these problems have been discussed before in the
+["Restructure CI job when keyword"](https://gitlab.com/groups/gitlab-org/-/epics/6788) epic.
+
+## Motivation
+
+The list of problems is the main motivation for this blueprint.
+
+### Problem 1: The responsibility of the `when` keyword
+
+Right now, the [`when`](../../../ci/yaml/index.md#when) keyword has many responsibilities;
+
+> - `on_success` (default): Run the job only when no jobs in earlier stages fail or have `allow_failure: true`.
+> - `on_failure`: Run the job only when at least one job in an earlier stage fails. A job in an earlier stage
+>   with `allow_failure: true` is always considered successful.
+> - `never`: Don't run the job regardless of the status of jobs in earlier stages.
+>   Can only be used in a [`rules`](../../../ci/yaml/index.md#rules) section or `workflow: rules`.
+> - `always`: Run the job regardless of the status of jobs in earlier stages. Can also be used in `workflow:rules`.
+> - `manual`: Run the job only when [triggered manually](../../../ci/jobs/job_control.md#create-a-job-that-must-be-run-manually).
+> - `delayed`: [Delay the execution of a job](../../../ci/jobs/job_control.md#run-a-job-after-a-delay)
+>   for a specified duration.
+
+It answers three questions;
+
+- What's required to run? => `on_success`, `on_failure`, `always`
+- How to run? => `manual`, `delayed`
+- Add to the pipeline? => `never`
+
+As a result, for example; we cannot create a `manual` job with `when: on_failure`.
+This can be useful when persona wants to create a job that is only available on failure, but needs to be manually played.
+For example; publishing failures to dedicated page or dedicated external service.
+
+### Problem 2: Abuse of the `allow_failure` keyword
+
+We control the blocker behavior of a manual job by the [`allow_failure`](../../../ci/yaml/index.md#allow_failure) keyword.
+Actually, it has other responsibilities; _"determine whether a pipeline should continue running when a job fails"_.
+
+Currently, a [manual job](../../../ci/jobs/job_control.md#create-a-job-that-must-be-run-manually);
+
+- is not a blocker when it has `allow_failure: true` (by default)
+- a blocker when it has `allow_failure: false`.
+
+As a result, for example; we cannot create a `manual` job that is `allow_failure: false` and not a blocker.
+
+```yaml
+job1:
+  stage: test
+  when: manual
+  allow_failure: true # default
+
+job2:
+  stage: deploy
+```
+
+Currently;
+
+- `job1` is skipped.
+- `job2` runs because `job1` is ignored since it has `allow_failure: true`.
+- When we run/play `job1`;
+  - if it fails, it's marked as "success with warning".
+
+#### `allow_failure` with `rules`
+
+`allow_failure` becomes more confusing when using `rules`.
+
+From [docs](../../../ci/yaml/index.md#when):
+
+> The default behavior of `allow_failure` changes to true with `when: manual`.
+> However, if you use `when: manual` with `rules`, `allow_failure` defaults to `false`.
+
+From [docs](../../../ci/yaml/index.md#allow_failure):
+
+> The default value for `allow_failure` is:
+>
+> - `true` for manual jobs.
+> - `false` for jobs that use `when: manual` inside `rules`.
+> - `false` in all other cases.
+
+For example;
+
+```yaml
+job1:
+  script: ls
+  when: manual
+
+job2:
+  script: ls
+  rules:
+    - if: $ALWAYS_TRUE
+      when: manual
+```
+
+`job1` and `job2` behave differently;
+
+- `job1` is not a blocker because it has `allow_failure: true` by default.
+- `job2` is a blocker `rules: when: manual` does not return `allow_failure: true` by default.
+
+### Problem 3: Different behaviors in DAG/needs
+
+The main behavioral difference between DAG and STAGE is about the "skipped" and "ignored" states.
+
+**Background information:**
+
+- skipped:
+  - When a job is `when: on_success` and its previous status is failed, it's skipped.
+  - When a job is `when: on_failure` and its previous status is not "failed", it's skipped.
+- ignored:
+  - When a job is `when: manual` with `allow_failure: true`, it's ignored.
+
+**Problem:**
+
+The `skipped` and `ignored` states are considered successful in the STAGE processing but not in the DAG processing.
+
+#### Problem 3.1. Handling of ignored status with manual jobs
+
+**Example 1:**
+
+```yaml
+build:
+  stage: build
+  script: exit 0
+  when: manual
+  allow_failure: true # by default
+
+test:
+  stage: test
+  script: exit 0
+  needs: [build]
+```
+
+- `build` is ignored (skipped) because it's `when: manual` with `allow_failure: true`.
+- `test` is skipped because "ignored" is not a successful state in the DAG processing.
+
+**Example 2:**
+
+```yaml
+build:
+  stage: build
+  script: exit 0
+  when: manual
+  allow_failure: true # by default
+
+test:
+  stage: test
+  script: exit 0
+```
+
+- `build` is ignored (skipped) because it's `when: manual` with `allow_failure: true`.
+- `test2` runs and succeeds.
+
+#### Problem 3.2. Handling of skipped status with when: on_failure
+
+**Example 1:**
+
+```yaml
+build_job:
+  stage: build
+  script: exit 1
+
+test_job:
+  stage: test
+  script: exit 0
+
+rollback_job:
+  stage: deploy
+  needs: [build_job, test_job]
+  script: exit 0
+  when: on_failure
+```
+
+- `build_job` runs and fails.
+- `test_job` is skipped.
+- Even though `rollback_job` is `when: on_failure` and there is a failed job, it is skipped because the `needs` list has a "skipped" job.
+
+**Example 2:**
+
+```yaml
+build_job:
+  stage: build
+  script: exit 1
+
+test_job:
+  stage: test
+  script: exit 0
+
+rollback_job:
+  stage: deploy
+  script: exit 0
+  when: on_failure
+```
+
+- `build_job` runs and fails.
+- `test_job` is skipped.
+- `rollback_job` runs because there is a failed job before.
+
+### Problem 4: The skipped and ignored states
+
+Let's assume that we solved the problem 3 and the "skipped" and "ignored" states are not different in DAG and STAGE.
+How should they behave in general? Are they successful or not? Should "skipped" and "ignored" be different?
+Should we introduce new `when` conditions; `when: on: skipped/ignored/manual`?
+
+#### Problem 4.1. The ignored status with manual jobs
+
+With the newly proposed syntax;
+
+```yaml
+build:
+  stage: build
+  script: exit 0
+  when:
+    on: success
+    start_in: manual
+  output:
+    manual_block: false
+
+test:
+  stage: test
+  script: exit 0
+```
+
+- `build` is in the "manual" state but considered as "skipped" (ignored) for the pipeline processing.
+- `test` runs because "skipped" is a successful state.
+
+Alternatively;
+
+```yaml
+build1:
+  stage: build
+  script: exit 0
+  when:
+    on: success
+    start_in: manual
+  output:
+    manual_block: false
+
+build2:
+  stage: build
+  script: exit 0
+
+test:
+  stage: test
+  script: exit 0
+```
+
+- `build1` is in the "manual" state but considered as "skipped" (ignored) for the pipeline processing.
+- `build2` runs and succeeds.
+- `test` runs because "success" + "skipped" is a successful state.
+
+#### Problem 4.2. The skipped status with when: on_failure
+
+With the newly proposed syntax;
+
+```yaml
+build:
+  stage: build
+  script: exit 0
+  when:
+    on: failure
+
+test:
+  stage: test
+  script: exit 0
+```
+
+- `build` is skipped because it's `when: on_failure` and its previous status is not "failed".
+- `test` runs because "skipped" is a successful state.
+
+Alternatively;
+
+```yaml
+build1:
+  stage: build
+  script: exit 0
+  when:
+    on: failure
+
+build2:
+  stage: build
+  script: exit 0
+
+test:
+  stage: test
+  script: exit 0
+```
+
+- `build1` is skipped because it's `when: on_failure` and its previous status is not "failed".
+- `build2` runs and succeeds.
+- `test` runs because "success" + "skipped" is a successful state.
+
+### Problem 5: The `dependencies` keyword
+
+The [`dependencies`](../../../ci/yaml/index.md#dependencies) keyword is used to define a list of jobs to fetch
+[artifacts](../../../ci/yaml/index.md#artifacts) from. It is a shared responsibility with the `needs` keyword.
+Moreover, they can be used together in the same job. We may not need to discuss all possible scenarios but this example
+is enough to show the confusion;
+
+```yaml
+test2:
+  script: exit 0
+  dependencies: [test1]
+  needs:
+    - job: test1
+      artifacts: false
+```
+
+### Information 1: Canceled jobs
+
+Are a canceled job and a failed job the same? They have many differences so we could easily say "no".
+However, they have one similarity; they can be "allowed to fail".
+
+Let's define their differences first;
+
+- A canceled job;
+  - It is not a finished job.
+  - Canceled is a user requested interruption of the job. The intent is to abort the job or stop pipeline processing as soon as possible.
+  - We don't know the result, there is no artifacts, etc.
+  - Since it's never run, the `after_script` is not run.
+  - Its eventual state is "canceled" so no job can run after it.
+    - There is no `when: on_canceled`.
+    - Even `when: always` is not run.
+- A failed job;
+  - It is a machine response of the CI system to executing the job content. It indicates that execution failed for some reason.
+  - It is equal answer of the system to success. The fact that something is failed is relative,
+  and might be desired outcome of CI execution, like in when executing tests that some are failing.
+  - We know the result and [there can be artifacts](../../../ci/yaml/index.md#artifactswhen).
+  - `after_script` is run.
+  - Its eventual state is "failed" so subsequent jobs can run depending on their `when` values.
+    - `when: on_failure` and `when: always` are run.
+
+**The one similarity is; they can be "allowed to fail".**
+
+```yaml
+build:
+  stage: build
+  script: sleep 10
+  allow_failure: true
+
+test:
+  stage: test
+  script: exit 0
+  when: on_success
+```
+
+- If `build` runs and gets `canceled`, then `test` runs.
+- If `build` runs and gets `failed`, then `test` runs.
+
+### Information 2: Empty state
+
+We [recently updated](https://gitlab.com/gitlab-org/gitlab/-/merge_requests/117856) the documentation of
+[the `when` keyword](../../../ci/yaml/index.md#when) for clarification;
+
+> - `on_success`: Run the job only when no jobs in earlier stages fail or have `allow_failure: true`.
+> - `on_failure`: Run the job only when at least one job in an earlier stage fails.
+
+For example;
+
+```yaml
+test1:
+  when: on_success
+  script: exit 0
+  # needs: [] would lead to the same result
+
+test2:
+  when: on_failure
+  script: exit 0
+  # needs: [] would lead to the same result
+```
+
+- `test1` runs because there is no job failed in the previous stages.
+- `test2` does not run because there is no job failed in the previous stages.
+
+The `on_success` means that "nothing failed", it does not mean that everything succeeded.
+The same goes to `on_failure`, it does not mean that everything failed, but does mean that "something failed".
+This semantic goes by a expectation that your pipeline succeeds, and this is happy path.
+Not that your pipeline fails, because then it requires user intervention to fix it.
+
+## Goals
+
+- The `allow_failure` keyword must only responsible for marking failed jobs as "success with warning".
+  - This also means that canceled jobs must not be marked as "success with warning".
+- The `when` keyword must only answer the question "What's required to run?". And it must be the only source of truth
+  for deciding if a job should run or not.
+  - The "skipped" and "ignored" states must be reconsidered.
+  - The "never" condition must be reconsidered.
+- A new keyword structure must be introduced to specify if a job is an "automatic", "manual", or "delayed" job.
+- The `needs` keyword must only control the order of the jobs. It must not be used to control the behavior of the jobs
+  or to decide if a job should run or not.
+  - The DAG and STAGE behaviors must be the same.
+- The `needs` and `dependencies` keywords must not be used together in the same job.
+
+## Non-Goals
+
+We will not discuss how to avoid breaking changes for now.
+
+## Proposal
+
+We discussed some alternative solutions in
+the ["Restructure CI job when keyword"](https://gitlab.com/groups/gitlab-org/-/epics/6788) epic. Here, I'd like to
+summarize the final proposal.
+
+- `when:on`: Define state requirements to run a job.
+  - Can be a string or an array of strings.
+  - Possible values: `success`, `failed`, `always`.
+- `when:start_in`:
+  - Possible values: `immediately`, `manual`, `<time>` (delayed).
+  - Alternative syntax: `start: immediately`, `start: manually`, `start: in <time>`.
+- `output:manual_block`: Define if a manual job is a blocker or not.
+  - If this is `true`, then no subsequent job can run until this job is finished.
+  - If this is `false`, then the manual job will be "skipped" and subsequent jobs with `when: on_skipped` can run.
+
+```yaml
+test1:
+  when:
+    on: success # default
+    start_in: immediately # default
+  output:
+    allow_failure: false # default
+
+test2:
+  when:
+    on: success # default
+    start_in: manual
+  output:
+    manual_block: false # default
+    allow_failure:
+      exit_codes: [1,2,3]
+
+test3:
+  when:
+    on: success # default
+    start_in: manual
+  output:
+    manual_block: true
+
+test4:
+  when:
+    on: success # default
+    start_in: 1 hour
+
+test5:
+  needs: [test2]
+
+test6:
+  needs: [test2]
+  run:
+    when: [on_success, on_skipped]
+```
+
+## Design and implementation details
+
+N/A
+
+## Alternative Solutions
+
+N/A