Cache OCI image downloads for performance (#18237) · Epics · GitLab.org

Cache OCI image downloads for performance

## Problem Repeatedly downloading the same container image within a single job unnecessarily increases the execution time. ### Proposal Caching should be handled at both the image level and the layer level. ##### Image caching The step-runner fetches images using a tag or a digest. For example, `3`, `3.0.2`, `sha256:cf7940fdead4cebfb0c95bf0341e677664c4ddaa126bedb42f02abd924f436b2`. Images in their entirety can be cached, depending on the circumstance. - If the tag conforms to Semantic Versioning and specifies `MAJOR.MINOR.PATCH` or `MAJOR.MINOR.PATCH-RELEASE`, the entire image can be cached. No remote call is required if the step image is reused. - If a digest is used, the entire image can be cached. No remote call is required if the step image is reused. - If the tag is anything else, for example, `pipeline-1234567`, `latest`, `3`, `3.2`, the image should not be cached. A remote call is required to get the manifest, from there, individual layers should be cached. ##### Layer caching The step-runner should fetch an image layer-by-layer. Each layer should be cached. - To fetch an image, first fetch the image manifest for the intended platform - To fetch each layer for an image, first check the cache to see if the layer is cached. If it exists, don't fetch it again - When fetching a layer, save it to disk in the cache directory - Presumably, the layer digest should be included in the directory name - Each layer needs to write to a separate directory, currently, they all write to the same directory - It's likely most of this work can be done using a library, for example, [go-containerregistry Cache](https://github.com/google/go-containerregistry/blob/4eb8c4d84ef07af279d5dcc0b210d9eaa7bc79e3/pkg/v1/cache/cache.go#L28) - You may need your own `Cache` implementation to write log messages ##### Implementation concerns - Write debug logs saying whether or not the cache is reused. - Adequate testing should be provided to guarantee caching works as expected - Subsequent usages of the fetch step should reuse the same cache - Create a way of busting the cache if necessary. For example, if the environment variable `CI_STEPS_OCI_CACHE=false`, then WARN log that caching isn't being used, and don't cache ### Why not use Semantic Versioning to cache? For example, if `mystep:3` is already downloaded, can we use it again if downloading `mystep:3.0.2`? The thinking is that we should only use digest as a means of caching. Advantages: - It's clear to the user what version of their step is downloaded and used. If downloading `3`, it will always get the latest, instead of using the latest `3` from cache - The way steps run is determinstic, which is particularly important if/when a step is introduced that allows for parallel step execution Disadvantages: - This may require more remote requests than other cache strategies ### How does this relate to creating a lock file for step dependencies? If/when locking is implemented, then the lock file will determine what images to download, not some of the policies defined in this issue. The cache will still be used, but it will only be if the lock file image/layer digest matches what is in the cache. This caching strategy will still be important for steps that don't lock dependencies, or for when a step author is updating step dependencies.

epic