Shared Workspaces MVC

Problem to solve

A very common use case for pipelines is to implement a series of jobs that perform a transformation and/or analysis on a working set of files. Most builds work this way essentially, but these are becoming even more common as a paradigm for data science models. At the moment GitLab provides artifacts as one option, but these are more meant for longer term storage so persisting these for a long time is a waste. We also provide caching, but the cache is best-effort (so not guaranteed) and can cause surprises if its used as a workspace by causing builds to unexpectedly fail or behave strangely because they are missing their intermediary files.

Intended users

This feature is used by Delaney, Development Team Lead and Devon, DevOps Engineer to make builds easier to understand.

Further details

This issue is related to https://gitlab.com/gitlab-org/gitlab-ce/issues/22972, which creates the possibility of child pipelines. By allowing for child pipelines to run separately, we simplify the configuration of this issue significantly because we can treat pipelines as individual units, already aligned to what would make sense to have a single workspace. Previously, with more advanced branching within a single pipeline, determining which branches should get access to which workspaces was non-trivial.

Proposal

With this issue we will allow for persisting a workspace from job to job within a single pipeline. Simply let GitLab know in your .gitlab-ci.yml that intermediary output is produced in your pipeline, and we'll handle automatically ensuring that downstream jobs get access to upstream outputs.

In order to achieve this we do the following (note that an ephemeral keyword is described here, but this is probably a future enhancement):

Introduce top-level workspace: shared that enables a saveable shared workspace globally,
Introduce job-level workspace: ephemeral that marks a particular job as only using the workspace, but not saving to it (i.e., changes will be discarded),
Since Runner has to be shared, we require tags: to be the same across all jobs,
Since services: likely would have to be shared, we require all jobs to have the same services:,
We allow using exactly the same image: only,
Specifying workspace: ephemeral would maybe allow us to provide optimization for Kubernetes later: share volume across nodes, and add data on top of it using Copy on Write filesystem: and be able to perform horizontal scaling,
To retry job we would retry all jobs that have workspace: shared defined, we would ignore ephemeral,
We would require in the first stage to at most have one job, as this would be a job that pre-seeds workspace,
In the next stages, an unlimited amount of jobs could be defined, and the runner could later run them in parallel on the same workspace: and be able to do vertical scaling,
A runner would remove workspace once done, ie.: no more jobs,

Example

workspace: shared

default:
  image: ruby:2.1
  services:
    - postgres:10.1
  tags: [select-my-runner]

bundle_install:
  stage: build
  script:
    - bundle install --path vendor/

rspec:
  stage: test
  workspace: ephemeral
  script:
    - bin/rspec

rubocop:
  stage: test
  workspace: ephemeral
  script:
    - bin/rubocop

staging:
  stage: deploy
  script:
    - deploy somewhere?

Limitations

Workspaces do not provide a shared workspace for concurrently running jobs.

Permissions and Security

This issue does not impact the permission or security roles and how they interact with GitLab CI/CD as the changes are limited to the structure of the .gitlab-ci.yml.

Workspaces should contribute to storage limits.

Documentation

Testing

Testing for both the positive and negative test cases should be simple enough to understand by enabling/disabling workspaces for certain jobs, all of which are guaranteed to create a certain file. Then if the workspace is enabled and a previous job created a file, and that file is present in the next job - the positive case passes. If the workspace is NOT enabled on a subsequent job, and the file is not present in that next job - the negative case passes.

What does success look like, and how can we measure that?

Add a boolean corresponding with each existing usage ping of "Pipeline run" to state either workspace enabled or no workspace enabled based on the presence or absence of workspace: in the .gitlab-ci.yml for that pipeline.

Links / references

https://gitlab.com/gitlab-org/gitlab-ce/issues/47062 was the original issue from which this was derived.
See Circle CI workspaces which implement a similar primitive. There's a lot to learn from their implementation, although we'll do things slightly differently. The basic idea of changes within a workspace automatically propagating to subsequent jobs is still at the core of their implementation.
Jenkins also uses the concept of workspaces natively as it was originally built in a pre-containerized world where there were "jenkins slaves" that were actual computers (virtual or otherwise) used as build agents. Thus the concept of a folder on the disk as the "workspace" for a project was core to how Jenkins works.
To support our overall vision for CI/CD as a whole, this is a critical primitive that the team sees as required to advance our vision.

Edited Aug 21, 2019 by Jason Yavorsky