Align on global solution to CI thundering herds

Problem

Users continue to leverage our CI features to build increasingly larger CI pipelines. It is not uncommon for some projects to have hundreds of jobs per pipeline.

In addition, repositories are growing larger, in particular those who are employing monorepos for atomic commits.

The combination of these two trends, results in negative impacts for our service.

Infrastructure costs increase. Each job performs a fresh repo clone, so for each commit we transfer costs are (# of jobs) * (size of shallow clone). In addition, the larger the repository size and higher frequency of jobs result in increased CPU/Memory load on gitaly infrastructure, requiring larger compute instances with higher performing disks.
Job start time increases. Each job performs a shallow clone, and will not start executing the job until that is complete. As repository size grows, and as load increases on the gitaly servers, the time to clone increases (or worse, starts to become unreliable).

We should find a solution to these problems, as it is not only inefficient to throw money (infra costs) at the problem, but there are also UX and limits to scale.

Why we should align

We are already exploring some paths to alleviating this problem, but there are also likely potential solutions are not exploring. We may not be pursuing the globally optimal path.

In addition, it is likely beneficial to not solve this problem in multiple different ways, as that work may have been better spent elsewhere.

Potential solutions

We are already pursuing multiple solutions today:

Improving Geo support for offloading CI jobs: &9779 (closed)
Improving the caching/efficiency of Gitaly
Manually implementing repository caching within each CI pipeline YML (initial job to download and store in object storage, subsequent jobs then retrieve this file)

There are also likely other opportunities:

CI artifact caching service: &11024
Built-in caching of repositories within Runner fleets
...

Desired outcomes

Alignment on globally optimal solution to this problem, and a timeframe and DRI for executing.

Edited Sep 19, 2023 by Joshua Lambert