Authentication failure when updating submodules using fetch strategy with shared build directories

Summary

The change introduced in !3134 (merged) (released in 14.4.0) is causing an authentication failure when jobs execute with a "Reinitialized existing Git repository" and there are submodules present requiring a fetch to checkout the appropriate ref. This only occurs when FF_ENABLE_JOB_CLEANUP is enabled on the runners as !3134 (merged) was specifically a change to this feature flags behavior.

Steps to reproduce

Deploy runner in a configuration where shared storage is used allowing jobs to use a fetch strategy reusing the clone of a repository from a previous job.
Run job from project 'A' having submodule 'B' with fetch strategy of normal
Push new commit to submodule 'B'
Update submodule 'B' in project 'A' to point at new ref present on submodule 'B'
Run job from project 'A' observing the job includes "Reinitialized existing Git repository" in the job log, and is running with the same exact build directory as the previous job from step 2. This job will fail with an error similar to the above.

Actual behavior

Job fails with authentication error. Inspecting git config files after step 5 above will reveal an old expired token in the submodule's config. For example:

.git/config

[remote "origin"]
	url = https://gitlab-ci-token:CURRENT_CI_JOB_TOKEN@git.example.com/my/project.git
	fetch = +refs/heads/*:refs/remotes/origin/*
[submodule "vendor"]
	active = true
	url = https://gitlab-ci-token:CURRENT_CI_JOB_TOKEN@git.example.com/my/project-vendor-mod.git

.git/config/modules/vendor/config

[remote "origin"]
	url = https://gitlab-ci-token:EXPIRED_CI_JOB_TOKEN@git.example.com/my/project-vendor-mod.git
	fetch = +refs/heads/*:refs/remotes/origin/*

As a result of removing the .git/config file during cleanup, there is a slight but important difference in behavior when submodules are present. Submodule urls will be missing from the .git/config file when it is re-initialized, as these urls are typically added to the .git/config file when git submodule init command is run following a fresh clone of a repository.

The runner executes git submodule sync to ensure that when a URL of a submodule changes (including when the value of CI_JOB_TOKEN rotates, as it will for every job that is executed) the url in .git/config/modules/vendor/config (for example) will be updated to match the one found in .git/config. Since however the url is missing from the .git/config file when the sync command is run, the url fails to update resulting in the expired and stale token remaining in the .git/config/modules/vendor/config file.

After the sync command has been run, runners currently call git submodule update --init to update and/or init submodules. This worked well enough when submodules were either new (and didn’t require a sync) or were being reinitialized with their config already present in .git/config having already been inited from a previous execution. Since the init is occurring after the sync instead of prior to, the [submodule "vendor"] section in .git/config is not present early enough in the job setup script for git submodule sync to do it's job, ultimately resulting in expired tokens being used in the attempted submodule fetch.

Expected behavior

The job should always complete successfully, cleanup should not result in expired/stale job tokens being used when attempting to fetch submodules during job startup.

Relevant logs and/or screenshots

Fetching changes...
Reinitialized existing Git repository in /builds/<runner>/<concurrent>/my/project/.git/
Created fresh repository.
Checking out e1b05cc9 as <redacted>...
Updating/initializing submodules...
Entering 'vendor'
Entering 'vendor'
HEAD is now at 0d6e7569 <redacted>
Submodule 'vendor' (https://gitlab-ci-token:[MASKED]@git.example.com/my/project-vendor-mod.git) registered for path 'vendor'
remote: HTTP Basic: Access denied
fatal: Authentication failed for 'https://git.example.com/my/project-vendor-mod.git/'
Unable to fetch in submodule path 'vendor'; trying to directly fetch 073a7337c0f3e459ae49906fe429bc4a30c327ed:
remote: HTTP Basic: Access denied
fatal: Authentication failed for 'https://git.example.com/my/project-vendor-mod.git/'
Fetched in submodule path 'vendor', but it did not contain 073a7337c0f3e459ae49906fe429bc4a30c327ed. Direct fetching of that commit failed.

Environment description

We began encountering this after upgrading our runners from 14.3.x to 14.5.0 (we skipped over 14.4.0). The runners are deployed via Helm on a GKE cluster, although the issue should be readily reproduced with any of the executors which support shared build directories (and use of the fetch strategy)

slightly redacted configmap.yaml

apiVersion: v1
kind: ConfigMap
data:
  config.template.toml: |
    [[runners]]
        image = "ubuntu:20.04"
        output_limit = 20480
        executor = "kubernetes"
        builds_dir = "/builds"
        environment = [
            "FF_ENABLE_JOB_CLEANUP=true",
        ]
        [runners.custom_build_dir]
            enabled = false
        [runners.kubernetes]
            namespace = "gitlab-runner"
            privileged = false
            allow_privilege_escalation = false
            service_account = "gitlab-runner"
            pull_policy = "if-not-present"
            cpu_limit = "15"
            cpu_limit_overwrite_max_allowed = "15"
            memory_limit = "16Gi"
            memory_limit_overwrite_max_allowed = "16Gi"
            cpu_request = "200m"
            cpu_request_overwrite_max_allowed = "15"
            memory_request = "800Mi"
            memory_request_overwrite_max_allowed = "16Gi"
            helper_cpu_limit = "2"
            helper_memory_limit = "2Gi"
            helper_cpu_request = "200m"
            helper_memory_request = "100Mi"
            service_cpu_limit = "2"
            service_memory_limit = "2Gi"
            service_cpu_request = "200m"
            service_memory_request = "512Mi"
            poll_interval = 5
            poll_timeout = 600
            cleanup_grace_period_seconds = 0
            pod_termination_grace_period_seconds = 600
        [runners.kubernetes.pod_labels]
            "ci_commit_ref_slug" = "$CI_COMMIT_REF_SLUG"
            "ci_job_name" = "$CI_JOB_NAME"
            "ci_job_stage" = "$CI_JOB_STAGE"
            "ci_project_id" = "$CI_PROJECT_ID"
            "ci_project_name" = "$CI_PROJECT_NAME"
            "ci_project_namespace" = "$CI_PROJECT_NAMESPACE"
        [runners.kubernetes.pod_annotations]
            "ci_commit_ref_slug" = "$CI_COMMIT_REF_SLUG"
            "ci_job_name" = "$CI_JOB_NAME"
            "ci_job_stage" = "$CI_JOB_STAGE"
            "ci_project_id" = "$CI_PROJECT_ID"
            "ci_project_name" = "$CI_PROJECT_NAME"
            "ci_project_namespace" = "$CI_PROJECT_NAMESPACE"
            "ci_job_url" = "$CI_JOB_URL"
            "ci_pipeline_url" = "$CI_PIPELINE_URL"
            "ci_project_url" = "$CI_PROJECT_URL"
        [runners.kubernetes.node_selector]
            "node-pool" = "worker"
        [runners.kubernetes.node_tolerations]
            "gitlab-runner=true" = "NoSchedule"
        [[runners.kubernetes.volumes.host_path]]
            name = "repo"
            mount_path = "/builds"
            host_path = "/mnt/stateful_partition/kube-ephemeral-ssd/gitlab-builds"
        [runners.cache]
            Type = "gcs"
            Path = ""
            Shared = true
            [runners.cache.gcs]
                BucketName = "private-gitlab-cache-bucket"
  config.toml: |
    concurrent = 100
    check_interval = 5
    log_level = "info"
    listen_address = ':9252'

Used GitLab Runner version

Reproduced on 14.5.0, 14.5.2 and 14.6.0.

Possible fixes

This is the line of code responsible for the problem: !3134 (diffs)

Proposed resolution: !3265 (merged)

Edited Dec 29, 2021 by David Alger