Shared "Workspaces" for GitLab CI/CD jobs

Closed in favor of a NEW EPIC.

Problem to solve

GitLab CI/CD was built with many fundamental primitives in mind. Docker first, ephemeral build environments, repeatable and verifiable build steps have guided many of the architecture decisions of GitLab CI/CD. These principles are directly in line with our north star of speed and scalability.

However, this focus on ephemeral environments does have a few drawbacks. Today, GitLab CI/CD provides a few methods to pass data and information between jobs and stages. Cache and artifacts both have use cases for sending files, data, and information to other jobs. However, caching is a "best effort" layer, and artifacts are much more substantial than is needed for solving many use cases. For advanced users of GitLab CI/CD; however, this leaves a middle ground for sharing build environment data, variables, and intermediate build artifacts between jobs and stages.

Ephemeral environments can also be confusing when getting started with CI/CD can be tough when learning about caching and artifacts. Users naturally want to use caching, but that is not guaranteed, so when it's not available, their pipelines surprisingly die. Artifacts are durable and are suitable for passing information between sequential jobs, but have unintended side-effects such as persistence way beyond the pipeline run, and publishing and downloading of artifacts.

Problems this can solve:

Have a repo that generates a lot of artifacts (GBs) that using artifacts is too slow
https://gitlab.com/gitlab-org/gitlab-ee/issues/10479
Sharing volumes between jobs
Sharing services between jobs
Monorepos
Vertical/Horizational scalability

Intended users

This feature is used by Delaney, Development Team Lead and Devon, DevOps Engineer to easily share data, information and intermediate build artifacts between jobs and stages.

Further details

To support our overall vision for CI/CD as a whole, this is a critical primative that the ~Verify team sees as required to advance our vision.

Proposal

Provide a simple way for a user to specify a "workspace" such that they can easily understand the filesystem environment they are running their jobs in, and preserve that filesystem environment from job to job or even pipeline to pipeline (when desired).

In this way, a workspace is much like a shared disk that is used to run various scripts against. If one job creates a build artifact binary called output, then the next job can expect that output exists and can call it to run tests, etc.

In this way, a "workspace" makes running a set of CI/CD jobs very similar to if you had run the same scripts locally.

Implementation

Introduce workspace keyword into .gitlab-ci.yml syntax.
Workspace can be defined at the top level, or the job level, or both.
The default workspace is empty meaning no workspaces exist unless specified, but enabling a single shared workspace for your entire pipeline is as easy as adding a single top-level declaration.
When there's only a single declaration, the name of the workspace doesn't matter.
Multiple declarations with unique names allow advanced handling of state information, so the state isn't shared between jobs that don't need it.

Stretch Goals

Changes from a job are stored as layered changes to the workspace (e.g., gzip'd tarball in object storage)
Subsequent jobs get all cumulative changes from previous jobs. (e.g., two parallel jobs won't see each other's changes, but the downstream job receives the union of their changes)

Permissions and Security

This issue does not impact the permission or security roles and how they interact with GitLab CI/CD as the changes are limited to the structure of the .gitlab-ci.yml.

Documentation

Testing

Testing for both the positive and negative test cases should be simple enough to understand by enabling/disabling workspaces for certain jobs, all of which are guaranteed to create a certain file.

Then if the workspace is enabled and a previous job created a file, and that file is present in the next job - the positive case passes.

If the workspace is NOT enabled on a subsequent job, and the file is not present in that next job - the negative case passes.

What does success look like, and how can we measure that?

Add a boolean corresponding with each existing usage ping of "Pipeline run" to state either workspace enabled or no workspace enabled based on the presence or absence of workspace: in the .gitlab-ci.yml for that pipeline.

Links / references

Previous issue: https://gitlab.com/gitlab-org/gitlab-ce/issues/41947
See Circle CI workspaces which implement a similar primitive. There's a lot to learn from their implementation, although we'll do things slightly differently. The basic idea of changes within a workspace automatically propagating to subsequent jobs is still at the core of their implementation.
Jenkins also uses the concept of workspaces natively as it was originally built in a pre-containerized world where there were "jenkins slaves" that were actual computers (virtual or otherwise) used as build agents. Thus the concept of a folder on the disk as the "workspace" for a project was core to how Jenkins works.

Original Description

### Description

We have a long outstanding issue that getting started with CI/CD can be tough when it comes to understanding caching and artifacts. People naturally want to use caching, but that is not guaranteed, so when it's not available, their pipelines surprisingly die. Artifacts are durable and are suitable for passing information between sequential jobs, but have unintended side-effects such as persistence way beyond the pipeline run, and publishing and downloading of artifacts.

The problem is so significant that people have proposed making a simple mode where CI doesn't run concurrently at all, so each job can run on the same runner so managing state between jobs goes away as every job just shares the entire state of the job before. Some people create project-specific runners just for this purpose. Unfortunately, this approach doesn't scale. When your pipeline starts taking long enough that you want to use parallel jobs, you're stuck and have to rewrite your pipeline, now suddenly paying attention to all the complexity you were hoping to avoid.

Luckily, it seems CircleCI has introduced a great concept called Workspaces which solves this problem nicely. There's a lot to learn from their implementation, although we'll do things slightly differently. The basic idea is that changes within a workspace are automatically propagated to subsequent jobs.

Proposal

Introduce workspace keyword into .gitlab-ci.yml syntax.
Workspace can be defined at the top level, or at the job level, or both.
The default workspace is empty meaning no workspaces are used unless specified, but enabling a single shared workspace for your entire pipeline is as easy as adding a single top-level declaration.
Changes from a job are stored as layered changes to the workspace (e.g. gzip'd tarball in object storage)
Subsequent jobs get all cumulative changes from prior jobs. (e.g. two parallel jobs won't see each other's changes, but the downstream job will see the union of their changes)
When there's only a single declaration, the name of the workspace doesn't matter.
Multiple declarations with unique names allow advanced handling of state information so state isn't unnecessarily shared between jobs that don't need it.

Notes / Questions

Engineering should read through the CircleCI docs as there's some good technical considerations in there.
Do we need to let people specify exactly what is stored? Certainly as an advanced option they should be able to limit how much is sent, but equally as certainly, we should let people start out easily and let all changes be captured automatically.
In CircleCI, I believe they only have one (layered) workspace, but use the job dependency graph to determine what data is passed between jobs. This might have advantages over named workspaces. Maybe having multiple named workspaces is unnecessary and it's more important to focus on declaring what to store and who needs to consume it.
We should be careful, even in the easy case, to not store content like the git repo. We already know how to do that for artifacts, so I imagine we can do something similar to only upload "untracked changes". But maybe that fails for some types of jobs.
We should have smart handling of caching so a downloaded cache isn't treated redundantly stored in the workspace. Maybe automatically exclude cache directories?
This should work well when using different docker images for different jobs. We're not storing all files in the workspace, just files under the working directory, which should be unaffected by the chosen docker image.

Links / references

Edited Jan 06, 2020 by Philippe Lafoucrière