The fetch GIT_STRATEGY should be more robust and fall back on clone on any failure

Problem Statement

We tried GIT_STRATEGY: fetch in a sizeable project with several submodules and we measured excellent speedup of the initialization process - 2-3 seconds vs. 1 minute on Linux. In turn, this allowed us to introduce more parallelism into the CI pipeline.

However, we noted that on occasion the local git tree would be corrupt, rendering the fetch strategy inoperable. We'd need to manually clean the tree up. This happens, we believe, if the git fetch process is interrupted by a cancelation of the pipeline (which we do with some frequency).

Our workaround is to set GIT_STRATEGY: none and implement our own git routine (below).

Reach

Affects Sasha (Software Developer), Devon (DevOps Engineer), Sidney (Systems Administrator), Rachel (Release Manager).

6.0 = Impacts a large percentage (~50% to ~80%) of the above.

Impact

2.0 = High impact

Fixing this problem would make the fetch strategy the uncompromising best choice in terms of speed and reliability. Currently people need to choose between the safe but slow clone and the fast but unreliable fetch.

Confidence

80% = Medium confidence

We could not identify a configuration problem on our side for the failures of the fetch strategy.

Effort

This is less than a week's worth coding work. Here is our workaround. The try_fetch function mimics the fetch strategy but on any failure it returns an error, which triggers removal of the directory and reinitialization from scratch:

variables:
  GIT_FETCH_CODE: |-
    function try_fetch {
      rm -f .git/index.lock .git/shallow.lock .git/HEAD.lock .git/hooks/post-checkout || return 1
      git remote add origin $$CI_REPOSITORY_URL 2>/dev/null || true
      git remote set-url origin $$CI_REPOSITORY_URL || return 2
      git fetch origin +refs/pipelines/$$CI_PIPELINE_ID:refs/pipelines/$$CI_PIPELINE_ID '+refs/heads/*:refs/remotes/origin/*' '+refs/tags/*:refs/tags/*' --force --prune --quiet || return 3
      git checkout -f -q $$CI_COMMIT_SHA || return 4
      git clean -ffdx --exclude=artifacts || return 5
      git submodule sync --recursive || return 6
      # Run these commands in parallel, maximum 8 processes - faster on Windows than serial run (same speed on Linux).
      git submodule foreach --recursive --quiet pwd | xargs -n1 -P8 -I{} sh -c "cd {} && git clean -ffxd && git reset --hard" || return 7
      # This command takes the longest time. Using parallel xargs helps by about 18%.
      git submodule foreach --quiet pwd | xargs -n1 -P8 git submodule update --force --init --recursive || return 8
    }
    SECONDS=0
    set +e
    try_fetch
    RESULT=$$?
    set -e
    if [[ $$RESULT != 0 ]]; then
      echo >&2 "Fetching failed with code $$RESULT, initiating a full clone..."
      # Ensure the artifacts dir exists even if empty, simplifies the logic that follows.
      mkdir -p artifacts
      # Save artifacts and restore them after getting everything from git.
      T=$$(mktemp --directory)
      mv artifacts $$T/
      cd ..
      # Create the dir transactionally - do everything in $$CI_PROJECT_DIR.deleteme, then mv upon success
      rm -rf $$CI_PROJECT_DIR/ $$CI_PROJECT_DIR.deleteme/
      mkdir -p $$CI_PROJECT_DIR.deleteme/PROJECTNAME.tmp/git-template
      git config -f "$$CI_PROJECT_DIR.deleteme/$$CI_PROJECT_NAME.tmp/git-template/config" fetch.recurseSubmodules false
      git config -f "$$CI_PROJECT_DIR.deleteme/$$CI_PROJECT_NAME.tmp/git-template/config" "http.https://git.symmetry.dev.sslCAInfo" "$$CI_SERVER_TLS_CA_FILE"
      git init "$$CI_PROJECT_DIR.deleteme" --template "$$CI_PROJECT_DIR.deleteme/$$CI_PROJECT_NAME.tmp/git-template"
      cd $$CI_PROJECT_DIR.deleteme
      try_fetch
      cd ..
      # The following atomic mv commits the transaction.
      mv $$CI_PROJECT_DIR.deleteme $$CI_PROJECT_DIR
      cd $$CI_PROJECT_DIR
      mv $$T/artifacts .
      rm -r $$T/
    fi
    echo "Getting files from git: done in $$SECONDS seconds."

To invoke the code, we use:

before_script:
  - eval "$GIT_FETCH_CODE"

on Linux machines, and:

before_script:
  - echo "$GIT_FETCH_CODE" | & "C:\Program Files\Git\bin\bash.exe"

on Windows machines.

This workarounds still has problems with artifacts (addressed by the script) and cache (not addressed by the script). They will be discussed in a subsequent issue.

Edited by 🤖 GitLab Bot 🤖