The limitations of CI for CD

Goal

Discuss the domain-based differences between CI and CD to highlight the weaknesses of the current GitLab CI offering, and move towards more CD-friendly pipelines

A bit of motivation that we (Viktor and Sam) agree with:

Invariably, CI systems are over-leveraged to handle the job of CD.

At its core, the goals of CI are markedly different from the goals of CD. CI aims to build and produce an artifact as efficiently as possible. On the other hand, the objective of CD is to carry an artifact as safely as possible to production. Whereas CI is generally a short-lived job, CD is often a long, drawn-out process where promoting something from dev to production might take hours, if not days. (source)

Problems with CI for CD and current workarounds/hacks

Use case	CI hack	Problem
A core object in CD are the artifacts passed around.	Use CI artifacts	CI artifacts are just a path. They are not signed, they can't be treated as trustworthy inputs to later steps because intermediate steps could change the path easily. There is no well defined input and output for CI jobs, and these (undefined) outputs can't be fed into future jobs in a traceable way.
A release artifact might be accepted or rejected to be consumed by later steps. An accepted artifact should never be turned into rejected (it might become obsolete).	Have two jobs, run one, cancel the other.	A user might still be able to re-run the cancelled job. Another workaround could be to use immutable artifacts (e.g. OCI containers, not pipeline artifacts), but we don't support them either.
Rich communication with external systems is needed	Run `curl` in the CI job and wait for the response. No support for pipeline management following external events.	`curl` (and alternatives) are just not a solution for asynchronous use cases where the external system will once report back, and the pipeline would need to continue.
Long running processes and conditional retries	Set timeouts and retries as needed	The provided workaround is not reach enough. Based on the outcome of the retries different paths need to be taken.
Condition checking to manage progress. For example a canary rollout might take the form of checking the status every five minutes to wait for the metrics to settle, continue the rollout if ok, or roll back after one hour.	Have a single long running job or a scheduled job with multiple pipelines.	Paying for the long-running job is against best-practices. The scheduled job creates a separate pipeline for every check.
Authentication and Authorisation	Tokens and other things passed around in CI Pipelines	Authentication and Authorisation is a key problem in long-lived deployment processes. Many providers have now evolved past the use of tokens and keys and into the realm of OIDC and trust that makes use of short lived keys and specific actors/resources that have permissions to do certain things. This is inherently more difficult to achieve (though not impossible in CI) and access through CI is mainly driven by highly sensitive, long lived tokens which could be exposed in CI and are more painful to rotate when exposed. An Operator inside your cloud environment could deploy resources without needing any long-lived keys and work with OIDC and clear permissions and audit trail instead.
Eventual Consistency	Long running pipelines / polling	Depending on the size of an environment or the method of moving users over to the new environment. The resources in the environment may take a long time for the newest changes to be rolled out. If the application/environment respects long running jobs that the user has set up, complete rollout of an environment could take days or more. A CI Pipeline is not designed to run this long and could experience issues doing so.

Overall, the impression is that CD would need a more rigorous workflow engine compared to what CI requires. Likely, a workflow engine that can support CD could support CI too, but a CI engine can not support CD use cases.

Additional wins

If we would have a workflow engine that is flexible enough to support CD workflows, it would likely be easier to support even other use cases, like generic automations across GitLab. (e.g. add a label when something happens; or run a script when a label got added to an issue)

What is needed in technical terms?

Based on discussions with engineers (@ash2k) we need an engine that can model Petri-nets. DAGs are a subset of Petri-nets.

The current heavily YAML focused CI syntax while provides great value and relatively easy onboarding for CI use cases, might be impossible to generalise for CD workflows. It's recommended to think about a lower-level programming model that provides a YAML API for CI and other low-code, no-code users. The lower-level programming interface could be opened up to expert users.

Prior art

ERP systems have workflow engines (e.g. Odoo, an OSS ERP)
Temporal is a full featured OSS programming model for workflows

Edited Sep 25, 2023 by Viktor Nagy (GitLab)