S3 cache: "The requested DurationSeconds exceeds the 1 hour session limit for roles assumed by role chaining" with pipeline timeout > 1h
Summary
This issue occurs when the Gitlab runner is running using an IAM role (Like an EC2 instance profile or an ECS task role) and UploadRoleARN is configured in the runners.cache.s3 section.
In this case, and if the pipeline timeout is configured > 1 hour, the following errors will occur when the runner tries to assume the role specified with UploadRoleARN:
Unable to generate cache upload environment:
failed to assume role: operation error STS: AssumeRole,
https response error StatusCode: 400, RequestID: ccd3b2b4-24b1-48ac-936d-a208acac655a, api error
ValidationError: The requested DurationSeconds exceeds the 1 hour session limit for roles assumed by role chaining.
The problem is that AWS has a 1 hour limit for role chaining, and that is not configurable (Even if the MaxSessionDuration is configured on the role specified by UploadRoleARN to more than 1 hour, the chaining is limited to 1 hour).
Steps to reproduce
- Run a Gitlab runner on EC2/ECS with an IAM instance profile allowing to assume the role specified by
UploadRoleARN. - Configure the S3 cache to use
UploadRoleARN. - Ensure that the cache is configured for the job and that the job timeout is > 1h.
- Run the job.
.gitlab-ci.yml
test:
script: touch cached_file
timeout: 2h
cache:
key: cache
paths:
- cached_file
Actual behavior
Error when assuming the role The requested DurationSeconds exceeds the 1 hour session limit for roles assumed by role chaining.
Expected behavior
Role is assumed and cache is uploaded correctly.
Environment description
Self-hosted runners running on AWS ECS/Fargate.
Used GitLab Runner version
Running with gitlab-runner 17.6.0 (374d34fd)
Using custom executor with driver fargate 0.5.1 (3fe14a6) ...
Possible fixes
The runner is using the timeout value to define the DurationSeconds of the assumed role session: https://gitlab.com/gitlab-org/gitlab-runner/-/blob/main/cache/s3v2/s3.go#L151
Since the role is assumed just before pushing the cache at the end of the job and that pushing the cache is likely a fast operation (less that 1 hour), using the whole timeout value as session duration appears to be excessive.
So, to resolve the issue, my proposal is to simply use 1 hour as the maximum session DurationSeconds.
if timeout >= 15*time.Minute && timeout <= 1*time.Hour {
duration = timeout
}