Automatable DevOps

Opportunity canvas

Problem to solve

GitLab users need higher-level, more expressive (i.e. domain-specific) tools to be more efficient.

GitLab offers a single application for the whole DevOps cycle. That application has an API too for programmatic interaction. However, we don't let the user use Ubiquitous language to express their intent to the application.

Here is an analogy to explain what I mean. Imagine you are talking to your phone's smart assistant named Siri.

Hey, Siri. Open contacts, type M, I, K, H, A, I, L. Open the first matching contact. Tap on "mobile phone" number to make the call. Turn off the screen, so that I don't tap on something with my ear.

Could be better, right? In fact, most of the software today is like that. We are still mostly in the stone age. A much better UX would be:

Hey, Siri. Call Mikhail.

As you can see above, both are interactions with the smart assistant (i.e. it's not the case of old technology vs new technology). For the assistant to actually be a leap forward, rather than just merely "speak instead of tapping", it needs to be operating in the user's domain. It's "how vs what" or "implementation-specific vs domain-specific". Put another way - the user is brought down to the level, where the tool operates vs the tool is brought up to the level where the user operates. At least that's how I understand it.

Here are some examples in the "domain-specific" style of interaction. As a user I'd like to be able to explain the following to GitLab:

When a commit is merged into the main branch a build should run and on success certain artifacts should be combined into a release, which then should be rolled out to pre-production environment using a certain deployment method, soaked there under synthetic load for 1 day, then promoted to staging and after 1 more day should be gradually rolled out to production. Production rollout should start with a canary deployment, then it should be scaled up in 10% increments each hour. For that deployment I'd like to have anomaly detection monitoring certain metrics and it should stop the rollout if something unusual is detected, notify the SRE team. If things are really bad (i.e. certains metrics breach certain thresholds), create an incident issue and start rolling the deployment back. Keep the incident issue up to date with what's happening with the deployment.
When an issue in a certain project is created, analyze it with machine learning. Depending on the outcome, make certain actions:
- is it spam? Delete the issue.
- is it abusive content? Delete the issue and report to authorities.
- normal issue. Label the issue with: X, Y and those labels, calculated by the job itself.
When a new release is created, run Kustomize processing job on it and on some templates from repository X. It will generate certain artifacts (YAML files). Commit those artifacts to a (different) repository into a new branch. Open an MR with the changes and assign the person, who created the release. Also, send a notification into the Slack channel. When the MR is approved, notify the Slack channel. When it's merged - notify the Slack channel. (This addresses use cases such as #327872).

You get the idea.

Proposed solution

We need a Domain-Specific Language (DSL) that can be used to express domain-specific workflows.

The user wants to describe a DevOps workflow that is triggered by an event. This can be modelled as a graph of actions that needs to be traversed and executed, starting at a certain vertex. We need a DSL to allow the user to construct that graph.

Most of the events that can happen in GitLab (the application) should be usable as a trigger for a workflow.

A workflow should also be able to attach another workflow to a domain entity (i.e. event source).

We should provide and maintain some actions as part of GitLab (the application), but the users should be able to create and share their own actions. This should be 100% self-service, with us being "out of the loop". For example, someone might create an action to upload a code coverage report to a 3rd party service and make the action available for others to use.

Proposed implementation

This is technical stuff, you don't have to understand or even read it to think of the idea itself. I just decided to write down what I have thought of already and perhaps it is useful for a technical conversation.

Declarative vs imperative

If we start thinking of an implementation from a user's perspective, obviously the first question a user would ask is "How do I describe what I want to the system?". Should this graph be represented explicitly as a declarative definition or implicitly as an imperative program? Some data points:

If we look at our CI YAML format (example), it's declarative, but not fully - we have conditions on jobs/etc that are evaluated during the "runtime" of the build.
If we look at a Helm Chart (example), it's declarative, but not fully - Charts are full of templated strings, conditions, loops, custom functions.
I think this article makes a very good point: Software infrastructure 2.0: a wishlist:

Now suddenly you move from YAML to YAML generated using Jinja or Handlebars or whatever. Slowly, you start adding custom functions to those template languages to make it easier to generate configuration. Eventually, it evolves into its own super-custom DSL with its own documentation.

This is super annoying! 10 times out of 10, I prefer to have everything accessible through a nice little client library. This library might in turn be a simple wrapper around a solid API. Now I can write my own for-loops! I can generate things dynamically! I don't have to learn a custom DSL! The world is a happy place again.
We should be aware of the configuration complexity clock and not fall into the trap, described there.

Imperative?

So, shall we create a library for interacting with each domain entity, let the user code what they want, and call it a day? The benefit is that you can unit test that code. But what are the drawbacks?

The drawback of the fully imperative approach, as far as we are concerned, is that it's basically impossible to understand what the program is doing without executing it. Why do we care?

To check that it's correct (at least to some extent) without executing it.

It's like interpreted dynamically typed languages vs compiled statically typed languages - to check that at least the types match you have to execute every single line of the program vs just run the compiler.
To show a rich meaningful UI for it. This is only possible if we are able to understand what a workflow would do (i.e. we are able to write a program that analyzes the user-supplied workflow definition).

We can have a UI for CI pipelines that shows us the stages and jobs or the graph only because the .gitlab-ci.yml has a well known declarative structure and we don't have to run the build just to understand the structure.

Declarative?

So, shall we embrace the opposite, declarative approach? There are many examples of tools that take this route (to some extent) - Terraform, Kubernetes, Ansible, etc.

These tools, except for Kubernetes, are actually hybrid - they allow to use conditions, loops, etc.

Only Kubernetes is truly declarative (not fully too, actually - you cannot create a namespaced object and then a namespace. Order still matters). Kubernetes makes you either write all the boilerplate by hand or generate it using some other tool (e.g. Helm, Kustomize), which is what most people do.

The choice

So what we have here is a choice:

A fully imperative tool.
A hybrid tool.
An imperative generator for a declarative tool.

I suppose we don't want to provide a declarative-only interface and leave it up to the community to figure out the best way to generate the input for it. I believe we should provide a solution that is self-sufficient.

But what about the benefits of declarativeness? How can we analyze the "hybrid" input with conditions without executing it?

Layered approach

I think the answer here is to have a layered solution:

graph TD
    User -->|Workflow DSL| L1(Transformation layer)
    L1 -->|Declarative graph description| L2(Execution layer)

Transformation layer takes in the workflow DSL and generates the declarative graph description for the execution layer.
Execution layer takes the declarative description and traverses the graph, executing actions.

For this to work, the transformation that is performed on the workflow DSL must be a pure function i.e. it must not actually do anything, it must only perform the transformation without any side effects and not consulting any external sources. With this property guaranteed, we can execute the transformation, get the output graph and work with it e.g. visualize it in a nice UI without actually running the workflow.

I know of at least two existing projects that use this approach to solve problems in other domains:

Pulumi. See Pulumi Architecture for the details.
Bazel. See Evaluation model. Our Transformation layer is similar to the Loading phase+Analysis phase and our Execution layer maps into Execution phase.

(I guess Terraform is similar too with its plan and apply phases).

Pulumi and Bazel look similar in how they transform the input into the declarative graph, but there is a huge and important difference: Pulumi uses general purpose languages to generate the graph and Bazel uses Starlark.

Starlark

Starlark, a subset of Python, was designed to write deterministic programs, while TypeScript, JavaScript, Python, Go, and C# (the languages Pulumi supports) are general purpose Turing-complete languages. For example, in Go and Python map/dictionary iteration order is undefined and, in fact, each time a map/dictionary is iterated, the order is different. Another example is I/O - any general purpose language can e.g. make network calls or read files. Such things make general purpose languages a poor choice for the task. It's just too easy to make a subtle mistake that introduces non-determinism. Such mistakes are hard to catch and can easily lead to incidents.

So, our goal is to have a transformation that is a pure function because then we can:

safely execute it to ensure (to some extent) that it's correct, mitigating the drawback of the dynamic language.
get the graph representation to be able to e.g. render it in a UI without actually executing the workflow itself (i.e. not running any actions).
be sure that the transformation is deterministic and so that each time it's executed the result is the same.

Taking into account all the above, I think we can conclude that:

General purpose programming languages are not a good tool for this job.
Starlark is fit for purpose, which is not a surprise at all, given that it was designed to be used for such transformations.

User experience

Below is a very rough sketch, needs to be properly thought through. I can imagine a user would describe a workflow like this.

# Import functions from the "standard library", provided by GitLab
load("@gitlab/events.star", "on_new_release_event")
load("@gitlab/kubernetes_deploy_actions.star", "deploy_release_to_environment")

# Binds a workflow named "promote_release" to the "new release" event.
on_new_release_event(
    name = "promote_new_release",
    # this may be unnecessary if this script is being executed in the context of the repo
    repository = "gitlab.com/example/project",
    triggers = [
        ":promote_release"
    ],
)

workflow(
    name = "promote_release",
    workflow = [
        ":dev_step",
        ":staging_step",
        ":prod_step",
    ],
)

workflow_step(
    name = "dev_step",
    conditions = [
        # no conditions to promote to dev
    ],
    actions = [
        ":deploy_to_dev",
    ],
)

workflow_step(
    name = "staging_step",
    conditions = [
        #
    ],
    actions = [
        ":deploy_to_staging",
    ],
)

workflow_step(
    name = "prod_step",
    conditions = [
    ],
    actions = [
        ":deploy_to_prod",
    ],
)

deploy_release_to_environment(
    name = "deploy_to_dev",
    # configuration
    environment = "dev", # how to properly refer to an environment?
    release = "...", # how to properly refer to a release?
    # etc...
)

What's missing - a lot. I haven't spent much time on details, wanted to kick off the discussion about the idea itself, the "big picture". Exact syntax, semantics, how to define actions, etc need to be figured out. I think we need to model a Petri net here. We should take inspiration from Bazel on how to define actions.

Action execution and security

Ideally actions are executed on Runner via Generic job API. In addition to that, we'll probably need to build the Starlark interpreter into Runner. It can be a separate binary that Runner calls to evaluate a program. Runner is built to handle user-supplied programs, so it's secure to do that.

Thoughts on related topics

Domain entities

We need to revise and perhaps formalize definitions of the domain entities - the ubiquitous language needs to be defined to ensure we and our users mean the same thing when we use a term.

Below are rough ideas that need work. Just want to share how I think of this stuff.

Domain entity: Artifact

Artifact is a result of some actions. Artifacts are identified by their address and/or id in the system. Artifacts are immutable. It must contain a cryptographically signed:

list of all the sources, artifacts, and tools that were used to produce it so that it's traceable.
addresses (e.g. URL) of the binaries that are the actual output/artifacts. Includes the size and a cryptographic hash of the contents of each file.
links to source repo(s), build that produced it, etc.
who created it.
when it was created.
other metadata, such as cryptographically signed results of a security scan (along with the version of the scanner used and its configuration). Security scan results, and similar things, need to be cryptographically signed so that a third party can trust that the creator of the artifact didn't forge them.

Domain entity: Release

Release is an immutable, named, versioned collection of artifacts. It must contain a cryptographically signed:

List of all artifacts and/or releases that comprise it.
who created it.
when it was created.

An interesting observation - our own use case (a release of GitLab) does not fit into our own existing definition of release. Existing release is bound to a repository, but a release of GitLab contains artifacts from many repositories - Gitaly, Pages, etc. We work around the limitation by storing versions of those things in files in the GitLab repository. I think there needs to be a first class feature for this use case. i.e. Release should probably not be bound to a project (although we can think of a repository as of a "holder" for releases); it should describe a collection of artifacts.

Otherwise, we already collect some metadata when a release is created, which is very good.

Things to avoid

I believe it's more important to learn from failure than from success. Here is a list of poor design choices (in my personal opinion) that we should avoid.

Templating configuration

Please see Parameterization pitfalls section of the Declarative application management in Kubernetes document.

One special problem - when substitution happens, is the string properly escaped? Hard to tell when there are nested layers - e.g. a string in a shell script in a string in a YAML file.

The above document also talks about configuration DSLs. I believe this does not apply to the DSL proposed here since there is no underlying configuration API - we are modeling a workflow, not replacing declarative configuration with a DSL. Declarative graph that is the result of DSL evaluation is an implementation detail, not an output, visible by the user.

Edited Sep 07, 2021 by Viktor Nagy (GitLab)