Packfile cache experiment plan
Now that we have something we can experiment with gitlab-org/gitaly!3230 (merged), we need to come up with a plan for our experiments.
Let's use this issue to discuss experiments and then when we decide to do them, give them their own issues.
Feature flag: gitaly_upload_pack_gitaly_hooks
1. (completed) Staging
production#3921. Nothing to report, not enough traffic.
2. (completed) gitlab-org/gitlab-test and gitlab-org/gitaly
Use gitlab-org/gitlab-test to dip our toes in the production water. Then immediately on to gitlab-org/gitaly which is so small it cannot strain the server it resides on, but which does get realistic traffic (the repo is under constant development by a small team).
3. (completed) gitlab-org/gitlab (fork and main)
First fork gitlab-org/gitlab, enable the feature for the fork, run CI on the fork. Observe hit rates.
Then enable the feature for gitlab-org/gitlab itself. Our simulations show that gitlab-org/gitlab will generate many cache entries but they will be small because of the CI_PRE_CLONE_SCRIPT. So in terms of bytes written, this should not be the heaviest thing.
4. (completed) gitlab-com/www-gitlab-com
This repo is not cloned/fetched as often as gitlab-org/gitlab but because it is very big (even a shallow clone is 1GB) and because there is no CI_PRE_CLONE_SCRIPT, the amount of bytes written into the cache would be on the high end of the spectrum. This would really put to the test if our existing Gitaly servers can handle the extra write IO workload generated by the cache.
Note that as expected, we saw a significant increase in disk write throughput but the server seems to cope well. production#4010 (comment 534564684)
5. (completed) gitlab-org/gitlab without CI_PRE_CLONE_SCRIPT
The cache is handling a high number of fetches per second for gitlab-org/gitlab, but the fetches are mostly quite small because of the existing caching via CI_PRE_CLONE_SCRIPT. I would like to see how the server copes when we temporarily disable CI_PRE_CLONE_SCRIPT. It should lead to a significant increase in network egress on file-cny-01, and also for disk write throughput.
Outcome: barely an increase in disk write throughput, but indeed a 4-5x increase in network egress. production#4014 (comment 535837338)