Skip to content

When the s3 AuthenticationType is set to iam, jobs hang in "Skipping Git submodules setup" and "Saving cache for successful job" for 2+ minutes, then continue and finish successfully

Problem

If running GitLab-Runner from a container environment, when the S3 Cache adaptor's AuthenticationType is set to iam, jobs take 2-3 minutes longer than they previously did, whilst the adaptor attempts to authenticate.

We discovered this was due to the Minio client library (that we use for S3 interactions) supporting IMDSv2.

Calls to this metadata service can result in a long delay.

In a container environment, if the hop limit is 1, the IMDSv2 response does not return because going to the container is considered an additional network hop. To avoid the process of falling back to IMDSv1 and the resultant delay, in a container environment we recommend that you set the hop limit to 2. For more information, see Configure the instance metadata options.

https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/instancedata-data-retrieval.html

Workarounds

As suggested by AWS, increasing the hop limit to 2 will avoid this problem. An EC2 instances hop limit can be increased by following: https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/configuring-instance-metadata-options.html#configuring-IMDS-existing-instances.

Solutions

The AWS SDK added a 1s timeout for the IMDSv2 call. We've patched Minio upstream to do the same: https://github.com/minio/minio-go/pull/1626.

This will be merged in !3354 (merged), but we might decide to wait until Minio have officially released a new version with the bug fix.


Original Report

In GitLab Runner configuration, when s3 cache AuthenticationType is set to iam, jobs hang at the "Skipping Git submodules setup" for about 2-3 minutes. The job then carries on and hangs at the "Saving cache for successful job" for another 2 minutes and saves the cache successfully.

I tested this with AuthenticationType = "access_key" instead of iam and did not have this problem. It is only happening with iam authentication type. I tested each authentication type several times and my test results were consistent every time. I have attached a screenshot of my last 2 jobs, one with AuthenticationType = "access_key" and the other with AuthenticationType = "iam". There is 4 min difference between their durations. This is a very small test job that normally takes ~11-13 seconds.

config.toml
concurrent = 1
check_interval = 0

[session_server]
  session_timeout = 1800

[[runners]]
  name = "IAM tester for S3 cache"
  url = "https://gitlab.supermunn.com/"
  token = "REDACTED"
  executor = "docker"
  [runners.custom_build_dir]
  [runners.cache]
    Type = "s3"
    Shared = false
    [runners.cache.s3]
      #AccessKey = "REDACTED"
      #SecretKey = "REDACTED"
      BucketName = "runner-cache-iam-test"
      BucketLocation = "ca-central-1"
      AuthenticationType = "iam"
    [runners.cache.gcs]
    [runners.cache.azure]
  [runners.docker]
    tls_verify = false
    image = "alpine:latest"
    privileged = false
    disable_entrypoint_overwrite = false
    oom_kill_disable = false
    disable_cache = false
    volumes = ["/cache"]
    shm_size = 0
.gitlab-ci.yml
sbt-build:
  image: openjdk:11-jre-slim
  stage: build
  script:
    - mkdir -p ./target
    - echo "hello" > ./target/hello.txt
  cache:
    key: test-build-cache
    paths:
      - target/hello.txt
  artifacts:
    # pass the distribution created from this job to other job steps
    paths:
      - "target/hello.txt"
  tags:
    #- akili-runner
    - iamfors3
  rules:
    - if: $CI_COMMIT_MESSAGE =~ /knowb4.*/
This is suspected to have started after this merge request

@cbazan1

Edited by Arran Walker