We have a long outstanding issue that getting started with CI/CD can be tough when it comes to understanding caching and artifacts. People naturally want to use caching, but that is not guaranteed, so when it's not available, their pipelines surprisingly die. Artifacts are durable and are suitable for passing information between sequential jobs, but have unintended side-effects such as persistence way beyond the pipeline run, and publishing and downloading of artifacts.
The problem is so significant that people have proposed making a simple mode where CI doesn't run concurrently at all, so each job can run on the same runner so managing state between jobs goes away as every job just shares the entire state of the job before. Some people create project-specific runners just for this purpose. Unfortunately, this approach doesn't scale. When your pipeline starts taking long enough that you want to use parallel jobs, you're stuck and have to rewrite your pipeline, now suddenly paying attention to all the complexity you were hoping to avoid.
Luckily, it seems CircleCI has introduced a great concept called Workspaces which solves this problem nicely. There's a lot to learn from their implementation, although we'll do things slightly differently. The basic idea is that changes within a workspace are automatically propagated to subsequent jobs.
- Workspace can be defined at the top level, or at the job level, or both.
- The default workspace is empty meaning no workspaces are used unless specified, but enabling a single shared workspace for your entire pipeline is as easy as adding a single top-level declaration.
- Changes from a job are stored as layered changes to the workspace (e.g. gzip'd tarball in object storage)
- Subsequent jobs get all cumulative changes from prior jobs. (e.g. two parallel jobs won't see each other's changes, but the downstream job will see the union of their changes)
- When there's only a single declaration, the name of the workspace doesn't matter.
- Multiple declarations with unique names allow advanced handling of state information so state isn't unnecessarily shared between jobs that don't need it.
Notes / Questions
- Engineering should read through the CircleCI docs as there's some good technical considerations in there.
- Do we need to let people specify exactly what is stored? Certainly as an advanced option they should be able to limit how much is sent, but equally as certainly, we should let people start out easily and let all changes be captured automatically.
- In CircleCI, I believe they only have one (layered) workspace, but use the job dependency graph to determine what data is passed between jobs. This might have advantages over named workspaces. Maybe having multiple named workspaces is unnecessary and it's more important to focus on declaring what to store and who needs to consume it.
- We should be careful, even in the easy case, to not store content like the git repo. We already know how to do that for artifacts, so I imagine we can do something similar to only upload "untracked changes". But maybe that fails for some types of jobs.
- We should have smart handling of caching so a downloaded cache isn't treated redundantly stored in the workspace. Maybe automatically exclude cache directories?
- This should work well when using different docker images for different jobs. We're not storing all files in the workspace, just files under the working directory, which should be unaffected by the chosen docker image.