WIP: Resolve "Cache Cloned Repo on Runner"

NOTE: See the epic for more context on the effort to reduce repo cloning time.

UPDATE

This third (and final) attempt at a caching-based approach to reducing the repo clone times was also unsuccessful, for the following reasons:

  1. The repo size (6g) is such that even copying the data on local disk (from the runner mounted cache dir to the docker job instance dir) can take a significant amount of time. The amount of time varies too (presumably based on IO load on the instance). In testing, it took anywhere from 20 seconds (if the docker image already has it cached) to over 3 minutes. See some sample timings below.
  2. The lifetime and reuse schedule of the runners is such that cache misses are frequent, possibly even more frequent than cache hits. This is because the runner pool prefers "newer" runners to older ones, so jobs are more likely to get a runner which does not yet have the persistent repo cache available, and has to pull it down fresh. This time, coupled with the additional time to copy the cache to the instance, results in an overall longer time that just letting each job pull down the repo.

However, one thing that was discovered as part of the work in this MR was that the GIT_STRATEGY is defaulting to clone instead of fetch. Switching to fetch has the potential to reduce the full clone on many job runs. See #7035 (closed) for more details.

SAMPLE TIMINGS

without runner caching: 
https://gitlab.com/gitlab-com/www-gitlab-com/-/jobs/494191515 - 3:52

with runner caching and cache miss: 
https://gitlab.com/gitlab-com/www-gitlab-com/-/jobs/494317464 5:40
https://gitlab.com/gitlab-com/www-gitlab-com/-/jobs/494315115 4:32

with runner caching and cache hit:
https://gitlab.com/gitlab-com/www-gitlab-com/-/jobs/494313795 2:34
https://gitlab.com/gitlab-com/www-gitlab-com/-/jobs/494359949 3:01
https://gitlab.com/gitlab-com/www-gitlab-com/-/jobs/494431284 2:32
https://gitlab.com/gitlab-com/www-gitlab-com/-/jobs/494463828 2:26

DESCRIPTION

For each job in the www-gitlab-com CI/CD build, the git repo clone currently takes between a minute and a half and two minutes, because it is very big and takes a lot of network time.

If we cache the git clone locally on the runners, and then copy and only pull new commits since the last image was built, this time could be greatly reduced.

See more details in this slack thread.

BACKGROUND

This is the third attempt, the two prior attempts were:

  1. caching via docker image
  2. object-store-based caching

However, these were not successful, primarily due to the time it takes to download the actual volume of data of the repo over the network, regardless of how they are packaged.

There is more context of the prior attempts on those issues.

Thus, this approach attempts to avoid that by eliminating the full repo network download from each individual job, and instead only doing it once when the runner is provisioned, or only on the first job which runs on the runner.

IMPLEMENTATION

This implementation will leverage some of the same approaches used in the object-store-based caching approach, specifically the pre-clone-script approach.

Notes from Slack Thread

slack thread

  • shared-runners-manager-X and gitlab-shared-runners-manager-X runners are available for any users of GitLab.com. That's why creating a host-mounted volume there is not an option.
  • if we will limit jobs to be run on prmX then we can try it.
  • MR to Add persistent cache volume for gitlab-com prmX runners
    • "The volume and the size of www-gitlab-com will cause an increased usage of disk on the autoscaled VMs. The positive side is that these particular workers are not heavily used. Anyway leaving such note here, for case when it would cause problems :)"
    • "These are private runners used only by gitlab-com group on GitLab.com, so having the persistent, host-mounted volume doesn't create any security problem."
  • The volume mount is defined as: /persistent-cache:/persistent-cache, so you should be able to use /persistent-cache in the the pre-clone-script to download the initial repo copy or just use the existing one.
  • Changes were just uploaded to chef. Within next 30 minutes the change will be applied to config.toml on the Runners. However, because the autoscaled VMs are re-used on this managers, it may need ~24 hours to be fully rolled on the VMs.
  • The eval $CI_PRE_CLONE_SCRIPT hack is still there, so you should be able to start from here
  • To ensure all our runners have the CI_PRE_CLONE_SCRIPT hack, we need to pin the jobs to the prmXrunners. It means we should set this for all jobs:
    tags:
    - gitlab-org
    - docker
  • And only the prmX Runner for the gitlab-com group have the special volume mounted
  • In our case we have three groups of Runners:
    • gitlab-shared-runners-manager-X that have only the gitlab-org tag.
    • private-runners-manager-X that have gitlab-org and docker.
    • shared-runners-manager-X that don't have the gitlab-org tag at all.
  • To be sure that the job will be running on prmX we need to use gitlab-org (to exclude srmX runners) and docker (to exclude gsrmX)
  • Question: What are the underlying VM types of the private-runners-manager-x.gitlab.com runners (i.e. CPU and RAM)?
    • It's n1-standard-2 in GCP. Which means 2 vCPU and 7 GB
    • They also use also a 50GB SSD disks

TASKS

  • Set up CI_PRE_CLONE_SCRIPT variable - see documentation here and required contents below (note this is not yet merged or available on live docs site)
  • Do not fail if script runs on a runner which doesn't have persistent-cache directory configured.
  • Add cache expiration timestamp with format date "+%Y%m%d%H%M%S", and an env var REPO_CACHE_EXPIRE_HOURS to control it. Default to 12.
  • Add an REPO_CACHE_EXPIRE_TIMESTAMP optional env var, with a format of date "+%Y%m%d%H%M%S". This can be used to force immediate expiration of cache.
  • Ensure all jobs are tagged with gitlab-org AND docker, to ensure they run on the prmX (private-runners-manager-X) runners.
  • The CI_PRE_CLONE_SCRIPT doesn't always echo output when it is it failing...
    • This was because the jobs weren't tagged to run on the correct runners.
  • Handle case if job is terminated or killed while cloning
  • Prevent two jobs from trying to clone to the cache at the same time
  • Should fall back to regular cloning if another job is currently writing the clone
  • Should fall back to regular cloning if the clone to cache fails for any reason
  • Are we sure that having only two prmX runners for all of the www-gitlab-com CI/CD pipelines will be sufficient? Are there any metrics to monitor to watch out for resource overload, excessive queuing, etc?

CI_PRE_CLONE_SCRIPT variable contents:

# Handle caching/retrieval of repo in local runners persistent cache
# See https://gitlab.com/gitlab-com/www-gitlab-com/-/merge_requests/44591
#
# This script assumes the following CI variables are set:
#  GIT_STRATEGY: "fetch" (or set it in the Project CI/CD Settings GUI)
#  GIT_SUBMODULE_STRATEGY: "none"
#  GIT_DEPTH: "10" (or whatever)
#
# _CACHE_STATUS is a write mutex/lock used to:
# 1. Prevent multiple jobs from trying to clone the repo at the same time
# 2. Ensure re-cloning of incomplete clones due to error or job being terminated
# _CACHE_STATUS values are:
# * 'not-cloned'
# * 'cloning'
# * 'cloned'
# * 'failed'

# IMPORTANT NOTE: Because of the way this script is eval'd,
# (see https://ops.gitlab.net/gitlab-cookbooks/chef-repo/-/merge_requests/2312/diffs)
# locally declared variables must be referenced with a double dollar sign,
# e.g. $$_REPO_CACHE_DIR.  Global CI variables use a single dollar sign,
# e.g. $CI_PROJECT_NAME

# ENABLE_CI_PRE_CLONE_SCRIPT is a "feature flag" to only run the script if the build opts-in by setting the variable
if [ "$ENABLE_CI_PRE_CLONE_SCRIPT" = 'true' ]; then

_REPO_CACHE_DIR="/persistent-cache/$CI_PROJECT_NAME"
_CACHE_STATUS_FILE="/persistent-cache/$CI_PROJECT_NAME-mutex"

if [ -f $$_CACHE_STATUS_FILE ]; then
  _CACHE_STATUS=`cat $$_CACHE_STATUS_FILE`
  echo "Starting CI_PRE_CLONE_SCRIPT, $$_CACHE_STATUS_FILE contains $$_CACHE_STATUS"
else
  _CACHE_STATUS='not-cloned'
  echo "Starting CI_PRE_CLONE_SCRIPT, _CACHE_STATUS is $$_CACHE_STATUS"
fi

# If the persistent-cache directory does not exist on this runner, just create
# it locally so we don't fail.
if [ ! -d "/persistent-cache" ]; then
  echo "/persistent-cache directory does not exist, creating it..."
  mkdir -p "/persistent-cache"
fi

# If cache does not exist or is in a failed state, create it
if [ $$_CACHE_STATUS = 'not-cloned' ] || [ $$_CACHE_STATUS = 'failed' ]; then
  echo "Repository cache dir is in $$_CACHE_STATUS status, cloning repo to $$_REPO_CACHE_DIR..."
  _CACHE_STATUS='cloning'
  echo $$_CACHE_STATUS > $$_CACHE_STATUS_FILE

  # Delete any existing dir which may exist from a previously incomplete or killed job
  rm -rf $$_REPO_CACHE_DIR

  date -u
  if git clone --depth $GIT_DEPTH --progress $CI_REPOSITORY_URL $$_REPO_CACHE_DIR; then
    date -u
    # This speeds up the fetch that the jobs do later
    echo "Repository successfully cloned.  Fetching master..."
    cd $$_REPO_CACHE_DIR
    git fetch --prune --depth $GIT_DEPTH origin master
    _FETCHED='true'
    _CACHE_STATUS='cloned'
    echo $$_CACHE_STATUS > $$_CACHE_STATUS_FILE
    echo "Repository successfully cached to $$_REPO_CACHE_DIR."
  else
    echo "Cloning of repo to the repository cache dir failed, this job will have to do a regular remote clone..."
    _CACHE_STATUS='failed'
    echo $$_CACHE_STATUS > $$_CACHE_STATUS_FILE
    rm -rf $$_REPO_CACHE_DIR
  fi
fi

if [ $$_CACHE_STATUS = 'cloning' ]; then
  echo "Another job is currently cloning the repo to the repository cache dir $$_REPO_CACHE_DIR, this job will have to do a regular remote clone..."
fi

if [ $$_CACHE_STATUS = 'cloned' ]; then
  if [ ! "$$_FETCHED" = 'true' ]; then
    date -u
    echo "Fetching master to refresh cache..."
    cd $$_REPO_CACHE_DIR
    git fetch --prune --depth $GIT_DEPTH origin master
  fi

  if [ -d $CI_PROJECT_DIR/.git ]; then
    date -u
    echo "A git repo already exists in $CI_PROJECT_DIR, not copying repository cache."
    _EXISTING_REPO='true'
  else
    # Copy the repo cache dir to the CI project dir
    date -u
    echo "Removing default $CI_PROJECT_DIR..."
    rm -rf $CI_PROJECT_DIR

    date -u
    echo "Copying repository cache from $$_REPO_CACHE_DIR into $CI_PROJECT_DIR..."
    cp -r $$_REPO_CACHE_DIR $CI_PROJECT_DIR

    date -u
    echo "Changing ownership to a+w for $CI_PROJECT_DIR..."
    chmod a+w $CI_PROJECT_DIR
  fi

  date -u
fi

echo "Finished CI_PRE_CLONE_SCRIPT"
date -u

fi

Closes #6940 (closed)

/label ~"group::static site editor" ~backstage

Edited by Chad Woolley

Merge request reports

Loading