When the s3 AuthenticationType is set to iam, jobs hang in "Skipping Git submodules setup" and "Saving cache for successful job" for 2+ minutes, then continue and finish successfully
Problem
If running GitLab-Runner from a container environment, when the S3 Cache adaptor's AuthenticationType is set to iam, jobs take 2-3 minutes longer than they previously did, whilst the adaptor attempts to authenticate.
We discovered this was due to the Minio client library (that we use for S3 interactions) supporting IMDSv2.
Calls to this metadata service can result in a long delay.
In a container environment, if the hop limit is 1, the IMDSv2 response does not return because going to the container is considered an additional network hop. To avoid the process of falling back to IMDSv1 and the resultant delay, in a container environment we recommend that you set the hop limit to 2. For more information, see Configure the instance metadata options.
https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/instancedata-data-retrieval.html
Workarounds
As suggested by AWS, increasing the hop limit to 2 will avoid this problem. An EC2 instances hop limit can be increased by following: https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/configuring-instance-metadata-options.html#configuring-IMDS-existing-instances.
Solutions
The AWS SDK added a 1s timeout for the IMDSv2 call. We've patched Minio upstream to do the same: https://github.com/minio/minio-go/pull/1626.
This will be merged in !3354 (merged), but we might decide to wait until Minio have officially released a new version with the bug fix.
Original Report
In GitLab Runner configuration, when s3 cache AuthenticationType is set to iam, jobs hang at the "Skipping Git submodules setup" for about 2-3 minutes. The job then carries on and hangs at the "Saving cache for successful job" for another 2 minutes and saves the cache successfully.
I tested this with AuthenticationType = "access_key" instead of iam and did not have this problem. It is only happening with iam authentication type. I tested each authentication type several times and my test results were consistent every time. I have attached a screenshot of my last 2 jobs, one with AuthenticationType = "access_key" and the other with AuthenticationType = "iam". There is 4 min difference between their durations. This is a very small test job that normally takes ~11-13 seconds.
config.toml
concurrent = 1
check_interval = 0
[session_server]
session_timeout = 1800
[[runners]]
name = "IAM tester for S3 cache"
url = "https://gitlab.supermunn.com/"
token = "REDACTED"
executor = "docker"
[runners.custom_build_dir]
[runners.cache]
Type = "s3"
Shared = false
[runners.cache.s3]
#AccessKey = "REDACTED"
#SecretKey = "REDACTED"
BucketName = "runner-cache-iam-test"
BucketLocation = "ca-central-1"
AuthenticationType = "iam"
[runners.cache.gcs]
[runners.cache.azure]
[runners.docker]
tls_verify = false
image = "alpine:latest"
privileged = false
disable_entrypoint_overwrite = false
oom_kill_disable = false
disable_cache = false
volumes = ["/cache"]
shm_size = 0
.gitlab-ci.yml
sbt-build:
image: openjdk:11-jre-slim
stage: build
script:
- mkdir -p ./target
- echo "hello" > ./target/hello.txt
cache:
key: test-build-cache
paths:
- target/hello.txt
artifacts:
# pass the distribution created from this job to other job steps
paths:
- "target/hello.txt"
tags:
#- akili-runner
- iamfors3
rules:
- if: $CI_COMMIT_MESSAGE =~ /knowb4.*/