WIP: Resolve "Cache Cloned Repo on Runner"
NOTE: See the epic for more context on the effort to reduce repo cloning time.
UPDATE
This third (and final) attempt at a caching-based approach to reducing the repo clone times was also unsuccessful, for the following reasons:
- The repo size (6g) is such that even copying the data on local disk (from the runner mounted cache dir to the docker job instance dir) can take a significant amount of time. The amount of time varies too (presumably based on IO load on the instance). In testing, it took anywhere from 20 seconds (if the docker image already has it cached) to over 3 minutes. See some sample timings below.
- The lifetime and reuse schedule of the runners is such that cache misses are frequent, possibly even more frequent than cache hits. This is because the runner pool prefers "newer" runners to older ones, so jobs are more likely to get a runner which does not yet have the persistent repo cache available, and has to pull it down fresh. This time, coupled with the additional time to copy the cache to the instance, results in an overall longer time that just letting each job pull down the repo.
However, one thing that was discovered as part of the work in this MR was that the GIT_STRATEGY is defaulting to clone instead of fetch. Switching to fetch has the potential to reduce the full clone on many job runs. See #7035 (closed) for more details.
SAMPLE TIMINGS
without runner caching:
https://gitlab.com/gitlab-com/www-gitlab-com/-/jobs/494191515 - 3:52
with runner caching and cache miss:
https://gitlab.com/gitlab-com/www-gitlab-com/-/jobs/494317464 5:40
https://gitlab.com/gitlab-com/www-gitlab-com/-/jobs/494315115 4:32
with runner caching and cache hit:
https://gitlab.com/gitlab-com/www-gitlab-com/-/jobs/494313795 2:34
https://gitlab.com/gitlab-com/www-gitlab-com/-/jobs/494359949 3:01
https://gitlab.com/gitlab-com/www-gitlab-com/-/jobs/494431284 2:32
https://gitlab.com/gitlab-com/www-gitlab-com/-/jobs/494463828 2:26
DESCRIPTION
For each job in the www-gitlab-com CI/CD build, the git repo clone currently takes between a minute and a half and two minutes, because it is very big and takes a lot of network time.
If we cache the git clone locally on the runners, and then copy and only pull new commits since the last image was built, this time could be greatly reduced.
See more details in this slack thread.
BACKGROUND
This is the third attempt, the two prior attempts were:
However, these were not successful, primarily due to the time it takes to download the actual volume of data of the repo over the network, regardless of how they are packaged.
There is more context of the prior attempts on those issues.
Thus, this approach attempts to avoid that by eliminating the full repo network download from each individual job, and instead only doing it once when the runner is provisioned, or only on the first job which runs on the runner.
IMPLEMENTATION
This implementation will leverage some of the same approaches used in the object-store-based caching approach, specifically the pre-clone-script approach.
Notes from Slack Thread
-
shared-runners-manager-Xandgitlab-shared-runners-manager-Xrunners are available for any users of GitLab.com. That's why creating a host-mounted volume there is not an option. - if we will limit jobs to be run on prmX then we can try it.
- MR to Add persistent cache volume for gitlab-com prmX runners
- "The volume and the size of
www-gitlab-comwill cause an increased usage of disk on the autoscaled VMs. The positive side is that these particular workers are not heavily used. Anyway leaving such note here, for case when it would cause problems :)" - "These are private runners used only by
gitlab-comgroup on GitLab.com, so having the persistent, host-mounted volume doesn't create any security problem."
- "The volume and the size of
- The volume mount is defined as:
/persistent-cache:/persistent-cache, so you should be able to use/persistent-cachein the thepre-clone-scriptto download the initial repo copy or just use the existing one. - Changes were just uploaded to chef. Within next 30 minutes the change will be applied to
config.tomlon the Runners. However, because the autoscaled VMs are re-used on this managers, it may need ~24 hours to be fully rolled on the VMs. - The
eval $CI_PRE_CLONE_SCRIPThack is still there, so you should be able to start from here - To ensure all our runners have the CI_PRE_CLONE_SCRIPT hack, we need to pin the jobs to the
prmXrunners. It means we should set this for all jobs:tags: - gitlab-org - docker - And only the
prmXRunner for the gitlab-com group have the special volume mounted - In our case we have three groups of Runners:
-
gitlab-shared-runners-manager-Xthat have only thegitlab-orgtag. -
private-runners-manager-Xthat havegitlab-organddocker. -
shared-runners-manager-Xthat don't have thegitlab-orgtag at all.
-
- To be sure that the job will be running on
prmXwe need to usegitlab-org(to excludesrmXrunners) anddocker(to excludegsrmX) - Question: What are the underlying VM types of the private-runners-manager-x.gitlab.com runners (i.e. CPU and RAM)?
- It's n1-standard-2 in GCP. Which means 2 vCPU and 7 GB
- They also use also a 50GB SSD disks
TASKS
-
Set up CI_PRE_CLONE_SCRIPTvariable - see documentation here and required contents below (note this is not yet merged or available on live docs site) -
Do not fail if script runs on a runner which doesn't have persistent-cachedirectory configured. -
Add cache expiration timestamp with format date "+%Y%m%d%H%M%S", and an env varREPO_CACHE_EXPIRE_HOURSto control it. Default to 12. -
Add an REPO_CACHE_EXPIRE_TIMESTAMPoptional env var, with a format ofdate "+%Y%m%d%H%M%S". This can be used to force immediate expiration of cache. -
Ensure all jobs are tagged with gitlab-orgANDdocker, to ensure they run on theprmX(private-runners-manager-X) runners. -
The CI_PRE_CLONE_SCRIPTdoesn't always echo output when it is it failing...- This was because the jobs weren't tagged to run on the correct runners.
-
Handle case if job is terminated or killed while cloning -
Prevent two jobs from trying to clone to the cache at the same time -
Should fall back to regular cloning if another job is currently writing the clone -
Should fall back to regular cloning if the clone to cache fails for any reason -
Are we sure that having only two prmXrunners for all of thewww-gitlab-comCI/CD pipelines will be sufficient? Are there any metrics to monitor to watch out for resource overload, excessive queuing, etc?- As for the metrics, you can look on this panel: https://dashboards.gitlab.net/d/000000159/ci?orgId=1&refresh=5m&fullscreen&panelId=139&var-runner_type=private-runners&var-runner_managers=All&var-gitlab_env=gprd&var-gl_monitor_fqdn=All&var-has_minutes=yes&var-runner_job_failure_reason=All&var-jobs_running_for_project=0&var-runner_request_endpoint_status=All. It should be limited to private-runners only, and you are interested in the workers identified by 76af815c and 73678483
CI_PRE_CLONE_SCRIPT variable contents:
# Handle caching/retrieval of repo in local runners persistent cache
# See https://gitlab.com/gitlab-com/www-gitlab-com/-/merge_requests/44591
#
# This script assumes the following CI variables are set:
# GIT_STRATEGY: "fetch" (or set it in the Project CI/CD Settings GUI)
# GIT_SUBMODULE_STRATEGY: "none"
# GIT_DEPTH: "10" (or whatever)
#
# _CACHE_STATUS is a write mutex/lock used to:
# 1. Prevent multiple jobs from trying to clone the repo at the same time
# 2. Ensure re-cloning of incomplete clones due to error or job being terminated
# _CACHE_STATUS values are:
# * 'not-cloned'
# * 'cloning'
# * 'cloned'
# * 'failed'
# IMPORTANT NOTE: Because of the way this script is eval'd,
# (see https://ops.gitlab.net/gitlab-cookbooks/chef-repo/-/merge_requests/2312/diffs)
# locally declared variables must be referenced with a double dollar sign,
# e.g. $$_REPO_CACHE_DIR. Global CI variables use a single dollar sign,
# e.g. $CI_PROJECT_NAME
# ENABLE_CI_PRE_CLONE_SCRIPT is a "feature flag" to only run the script if the build opts-in by setting the variable
if [ "$ENABLE_CI_PRE_CLONE_SCRIPT" = 'true' ]; then
_REPO_CACHE_DIR="/persistent-cache/$CI_PROJECT_NAME"
_CACHE_STATUS_FILE="/persistent-cache/$CI_PROJECT_NAME-mutex"
if [ -f $$_CACHE_STATUS_FILE ]; then
_CACHE_STATUS=`cat $$_CACHE_STATUS_FILE`
echo "Starting CI_PRE_CLONE_SCRIPT, $$_CACHE_STATUS_FILE contains $$_CACHE_STATUS"
else
_CACHE_STATUS='not-cloned'
echo "Starting CI_PRE_CLONE_SCRIPT, _CACHE_STATUS is $$_CACHE_STATUS"
fi
# If the persistent-cache directory does not exist on this runner, just create
# it locally so we don't fail.
if [ ! -d "/persistent-cache" ]; then
echo "/persistent-cache directory does not exist, creating it..."
mkdir -p "/persistent-cache"
fi
# If cache does not exist or is in a failed state, create it
if [ $$_CACHE_STATUS = 'not-cloned' ] || [ $$_CACHE_STATUS = 'failed' ]; then
echo "Repository cache dir is in $$_CACHE_STATUS status, cloning repo to $$_REPO_CACHE_DIR..."
_CACHE_STATUS='cloning'
echo $$_CACHE_STATUS > $$_CACHE_STATUS_FILE
# Delete any existing dir which may exist from a previously incomplete or killed job
rm -rf $$_REPO_CACHE_DIR
date -u
if git clone --depth $GIT_DEPTH --progress $CI_REPOSITORY_URL $$_REPO_CACHE_DIR; then
date -u
# This speeds up the fetch that the jobs do later
echo "Repository successfully cloned. Fetching master..."
cd $$_REPO_CACHE_DIR
git fetch --prune --depth $GIT_DEPTH origin master
_FETCHED='true'
_CACHE_STATUS='cloned'
echo $$_CACHE_STATUS > $$_CACHE_STATUS_FILE
echo "Repository successfully cached to $$_REPO_CACHE_DIR."
else
echo "Cloning of repo to the repository cache dir failed, this job will have to do a regular remote clone..."
_CACHE_STATUS='failed'
echo $$_CACHE_STATUS > $$_CACHE_STATUS_FILE
rm -rf $$_REPO_CACHE_DIR
fi
fi
if [ $$_CACHE_STATUS = 'cloning' ]; then
echo "Another job is currently cloning the repo to the repository cache dir $$_REPO_CACHE_DIR, this job will have to do a regular remote clone..."
fi
if [ $$_CACHE_STATUS = 'cloned' ]; then
if [ ! "$$_FETCHED" = 'true' ]; then
date -u
echo "Fetching master to refresh cache..."
cd $$_REPO_CACHE_DIR
git fetch --prune --depth $GIT_DEPTH origin master
fi
if [ -d $CI_PROJECT_DIR/.git ]; then
date -u
echo "A git repo already exists in $CI_PROJECT_DIR, not copying repository cache."
_EXISTING_REPO='true'
else
# Copy the repo cache dir to the CI project dir
date -u
echo "Removing default $CI_PROJECT_DIR..."
rm -rf $CI_PROJECT_DIR
date -u
echo "Copying repository cache from $$_REPO_CACHE_DIR into $CI_PROJECT_DIR..."
cp -r $$_REPO_CACHE_DIR $CI_PROJECT_DIR
date -u
echo "Changing ownership to a+w for $CI_PROJECT_DIR..."
chmod a+w $CI_PROJECT_DIR
fi
date -u
fi
echo "Finished CI_PRE_CLONE_SCRIPT"
date -u
fi
Closes #6940 (closed)
/label ~"group::static site editor" ~backstage