Shared Workspaces MVC
Problem to solve
A very common use case for pipelines is to implement a series of jobs that perform a transformation and/or analysis on a working set of files. Most builds work this way essentially, but these are becoming even more common as a paradigm for data science models. At the moment GitLab provides artifacts as one option, but these are more meant for longer term storage so persisting these for a long time is a waste. We also provide caching, but the cache is best-effort (so not guaranteed) and can cause surprises if its used as a workspace by causing builds to unexpectedly fail or behave strangely because they are missing their intermediary files.
This issue is related to gitlab-ce#22972, which creates the possibility of child pipelines. By allowing for child pipelines to run separately, we simplify the configuration of this issue significantly because we can treat pipelines as individual units, already aligned to what would make sense to have a single workspace. Previously, with more advanced branching within a single pipeline, determining which branches should get access to which workspaces was non-trivial.
With this issue we will allow for persisting a workspace from job to job within a single pipeline. Simply let GitLab know in your .gitlab-ci.yml that intermediary output is produced in your pipeline, and we'll handle automatically ensuring that downstream jobs get access to upstream outputs.
In order to achieve this we do the following (note that an
ephemeral keyword is described here, but this is probably a future enhancement):
- Introduce top-level
workspace: sharedthat enables a saveable shared workspace globally,
- Introduce job-level
workspace: ephemeralthat marks a particular job as only using the workspace, but not saving to it (i.e., changes will be discarded),
- Since Runner has to be shared, we require
tags:to be the same across all jobs,
services:likely would have to be shared, we require all jobs to have the same
- We allow using exactly the same
workspace: ephemeralwould maybe allow us to provide optimization for Kubernetes later: share volume across nodes, and add data on top of it using Copy on Write filesystem: and be able to perform horizontal scaling,
- To retry job we would retry all jobs that have
workspace: shareddefined, we would ignore
- We would require in the first stage to at most have one job, as this would be a job that pre-seeds workspace,
- In the next stages, an unlimited amount of jobs could be defined, and the runner could later run them in parallel on the same workspace: and be able to do vertical scaling,
- A runner would remove workspace once done, ie.: no more jobs,
workspace: shared default: image: ruby:2.1 services: - postgres:10.1 tags: [select-my-runner] bundle_install: stage: build script: - bundle install --path vendor/ rspec: stage: test workspace: ephemeral script: - bin/rspec rubocop: stage: test workspace: ephemeral script: - bin/rubocop staging: stage: deploy script: - deploy somewhere?
- Workspaces do not provide a shared workspace for concurrently running jobs.
Permissions and Security
This issue does not impact the permission or security roles and how they interact with GitLab CI/CD as the changes are limited to the structure of the
Workspaces should contribute to storage limits.
Testing for both the positive and negative test cases should be simple enough to understand by enabling/disabling workspaces for certain jobs, all of which are guaranteed to create a certain file. Then if the workspace is enabled and a previous job created a file, and that file is present in the next job - the positive case passes. If the workspace is NOT enabled on a subsequent job, and the file is not present in that next job - the negative case passes.
What does success look like, and how can we measure that?
Add a boolean corresponding with each existing usage ping of "Pipeline run" to state either
workspace enabled or
no workspace enabled based on the presence or absence of
workspace: in the
.gitlab-ci.yml for that pipeline.
Links / references
- gitlab-ce#47062 was the original issue from which this was derived.
- See Circle CI workspaces which implement a similar primitive. There's a lot to learn from their implementation, although we'll do things slightly differently. The basic idea of changes within a workspace automatically propagating to subsequent jobs is still at the core of their implementation.
- Jenkins also uses the concept of workspaces natively as it was originally built in a pre-containerized world where there were "jenkins slaves" that were actual computers (virtual or otherwise) used as build agents. Thus the concept of a folder on the disk as the "workspace" for a project was core to how Jenkins works.
- To support our overall vision for CI/CD as a whole, this is a critical primitive that the team sees as required to advance our vision.