Shareable workspaces between jobs MVC
Note: the proposal here is very much related to our MVC for sticky runners (#17497), it may help to also take a look at that one as well as the epic &1418 which unifies them.
Problem to solve
A very common use case for pipelines is to implement a series of jobs that perform a transformation and/or analysis on a working set of files. Most builds work this way essentially, but these are becoming even more common as a paradigm for data science models. At the moment GitLab provides artifacts as one option, but these are more meant for longer term storage so persisting these for a long time is a waste. We also provide caching, but the cache is best-effort (so not guaranteed) and can cause surprises if its used as a workspace by causing builds to unexpectedly fail or behave strangely because they are missing their intermediary files.
Intended users
This feature is used by Delaney, Development Team Lead and Devon, DevOps Engineer to make builds easier to understand.
Further details
This issue is related to #16094 (closed), which creates the possibility of child pipelines. By allowing for child pipelines to run separately, we simplify the configuration of this issue significantly because we can treat pipelines as individual units, already aligned to what would make sense to have a single workspace. Previously, with more advanced branching within a single pipeline, determining which branches should get access to which workspaces was non-trivial.
This would also be the missing piece to help us achieve parity with how CircleCI caching options work. We have artifacts (permanent) and caches (semi-permanent, but not guaranteed) today but are missing workspaces:
Proposal
With this issue we will allow for persisting a workspace from job to job within a single pipeline. Simply let GitLab know in your .gitlab-ci.yml that intermediary output is produced in your pipeline, and we'll handle automatically ensuring that downstream jobs get access to upstream outputs.
In order to achieve this we do the following (note that an ephemeral
keyword is described here, but this is probably a future enhancement):
- Introduce top-level
workspace: shared
that enables a saveable shared workspace globally, - Introduce job-level
workspace: ephemeral
that marks a particular job as only using the workspace, but not saving to it (i.e., changes will be discarded), - Since Runner has to be shared, we require
tags:
to be the same across all jobs, - Since
services:
likely would have to be shared, we require all jobs to have the sameservices:
, - We allow using exactly the same
image:
only, - Specifying
workspace: ephemeral
would maybe allow us to provide optimization for Kubernetes later: share volume across nodes, and add data on top of it using Copy on Write filesystem: and be able to perform horizontal scaling, - To retry job we would retry all jobs that have
workspace: shared
defined, we would ignoreephemeral
, - We would require in the first stage to at most have one job, as this would be a job that pre-seeds workspace,
- In the next stages, an unlimited amount of jobs could be defined, and the runner could later run them in parallel on the same workspace: and be able to do vertical scaling,
- A runner would remove workspace once done, ie.: no more jobs,
Notes / Questions
- Engineering should read through the CircleCI docs as there's some good technical considerations in there.
- Do we need to let people specify exactly what is stored? Certainly as an advanced option they should be able to limit how much is sent, but equally as certainly, we should let people start out easily and let all changes be captured automatically.
- In CircleCI, I believe they only have one (layered) workspace, but use the job dependency graph to determine what data is passed between jobs. This might have advantages over named workspaces. Maybe having multiple named workspaces is unnecessary and it's more important to focus on declaring what to store and who needs to consume it.
- We should be careful, even in the easy case, to not store content like the git repo. We already know how to do that for artifacts, so I imagine we can do something similar to only upload "untracked changes". But maybe that fails for some types of jobs.
- We should have smart handling of caching so a downloaded cache isn't treated redundantly stored in the workspace. Maybe automatically exclude cache directories?
- This should work well when using different docker images for different jobs. We're not storing all files in the workspace, just files under the working directory, which should be unaffected by the chosen docker image.
- Is this compatible with shell runners? What kinds are possible or not?
- How does this work when the intermediary objects are very large? Do we need a hybrid solution with #17497?
- How do we limit the workspace size when it is very large?
- What will be our plan for sharing multiple source workspaces to a single upstream (as in a DAG, but also common in other flows), i.e. #32814 (closed) and #20686 (closed)
Links / references
- Previous issue: https://gitlab.com/gitlab-org/gitlab-ce/issues/41947
- https://circleci.com/blog/persisting-data-in-workflows-when-to-use-caching-artifacts-and-workspaces/
- https://circleci.com/blog/deep-diving-into-circleci-workspaces/
Example
workspace: shared
default:
image: ruby:2.1
services:
- postgres:10.1
tags: [select-my-runner]
bundle_install:
stage: build
script:
- bundle install --path vendor/
rspec:
stage: test
workspace: ephemeral
script:
- bin/rspec
rubocop:
stage: test
workspace: ephemeral
script:
- bin/rubocop
staging:
stage: deploy
script:
- deploy somewhere?
Limitations
- Workspaces could not provide a shared workspace for concurrently running jobs.
Permissions and Security
This issue does not impact the permission or security roles and how they interact with GitLab CI/CD as the changes are limited to the structure of the .gitlab-ci.yml.
Workspaces should contribute to storage limits.
Documentation
Testing
Testing for both the positive and negative test cases should be simple enough to understand by enabling/disabling workspaces for certain jobs, all of which are guaranteed to create a certain file. Then if the workspace is enabled and a previous job created a file, and that file is present in the next job - the positive case passes. If the workspace is NOT enabled on a subsequent job, and the file is not present in that next job - the negative case passes.
What does success look like, and how can we measure that?
Add a boolean corresponding with each existing usage ping of "Pipeline run" to state either workspace enabled
or no workspace enabled
based on the presence or absence of workspace:
in the .gitlab-ci.yml
for that pipeline.
Links / references
- ogdarun implementation: https://www.odagrun.com/docs/odagrun-work-spaces-feature-reference/
- https://gitlab.com/gitlab-org/gitlab-ce/issues/47062 was the original issue from which this was derived.
- See Circle CI workspaces which implement a similar primitive. There's a lot to learn from their implementation, although we'll do things slightly differently. The basic idea of changes within a workspace automatically propagating to subsequent jobs is still at the core of their implementation.
- Jenkins also uses the concept of workspaces natively as it was originally built in a pre-containerized world where there were "jenkins slaves" that were actual computers (virtual or otherwise) used as build agents. Thus the concept of a folder on the disk as the "workspace" for a project was core to how Jenkins works.
- To support our overall vision for CI/CD as a whole, this is a critical primitive that the team sees as required to advance our vision.
This page may contain information related to upcoming products, features and functionality. It is important to note that the information presented is for informational purposes only, so please do not rely on the information for purchasing or planning purposes. Just like with all projects, the items mentioned on the page are subject to change or delay, and the development, release, and timing of any products, features, or functionality remain at the sole discretion of GitLab Inc.