S3 cache: "The requested DurationSeconds exceeds the 1 hour session limit for roles assumed by role chaining" with pipeline timeout > 1h

Summary

This issue occurs when the Gitlab runner is running using an IAM role (Like an EC2 instance profile or an ECS task role) and UploadRoleARN is configured in the runners.cache.s3 section.

In this case, and if the pipeline timeout is configured > 1 hour, the following errors will occur when the runner tries to assume the role specified with UploadRoleARN:

Unable to generate cache upload environment:
failed to assume role: operation error STS: AssumeRole,
https response error StatusCode: 400, RequestID: ccd3b2b4-24b1-48ac-936d-a208acac655a, api error
ValidationError: The requested DurationSeconds exceeds the 1 hour session limit for roles assumed by role chaining.

The problem is that AWS has a 1 hour limit for role chaining, and that is not configurable (Even if the MaxSessionDuration is configured on the role specified by UploadRoleARN to more than 1 hour, the chaining is limited to 1 hour).

Steps to reproduce

  • Run a Gitlab runner on EC2/ECS with an IAM instance profile allowing to assume the role specified by UploadRoleARN.
  • Configure the S3 cache to use UploadRoleARN.
  • Ensure that the cache is configured for the job and that the job timeout is > 1h.
  • Run the job.
.gitlab-ci.yml
test:
  script: touch cached_file
  timeout: 2h
  cache:
    key: cache
    paths:
      - cached_file

Actual behavior

Error when assuming the role The requested DurationSeconds exceeds the 1 hour session limit for roles assumed by role chaining.

Expected behavior

Role is assumed and cache is uploaded correctly.

Environment description

Self-hosted runners running on AWS ECS/Fargate.

Used GitLab Runner version

Running with gitlab-runner 17.6.0 (374d34fd)
Using custom executor with driver fargate 0.5.1 (3fe14a6) ...

Possible fixes

The runner is using the timeout value to define the DurationSeconds of the assumed role session: https://gitlab.com/gitlab-org/gitlab-runner/-/blob/main/cache/s3v2/s3.go#L151

Since the role is assumed just before pushing the cache at the end of the job and that pushing the cache is likely a fast operation (less that 1 hour), using the whole timeout value as session duration appears to be excessive. So, to resolve the issue, my proposal is to simply use 1 hour as the maximum session DurationSeconds.

if timeout >= 15*time.Minute && timeout <= 1*time.Hour {
	duration = timeout
}
Edited by Jérémy Goutin