Cache AssumeRole credentials to reduce STS requests (!6549) · Merge requests · GitLab.org / gitlab-runner

What does this MR do?

When RoleARN is configured, the runner previously called STS AssumeRole on every cache upload and download. Under load (e.g. 200 concurrent jobs sharing the same cache key) this produces a burst of identical STS calls that increases latency and risks hitting the STS rate limit.

This change adds an in-process LRU cache for AssumeRole credentials:

Credentials are keyed by (roleARN, bucketName, objectName, upload) and cached for up to 1 hour (the maximum STS session duration). Jobs sharing the same cache key reuse the same credentials without extra STS calls.
The LRU is capped at 1,000 entries (~200 KB) and uses a TTL-based background sweep so entries are evicted after expiry even if never accessed again.
minValidity is capped at 55 minutes so cache hits are always possible within the 1-hour session lifetime, even when the timeout parameter is configured at 1 hour or more. Credentials with less than minValidity remaining are considered stale and trigger a fresh STS call.
Sessions are always requested for the full 1 hour (decoupled from the timeout parameter) to maximise the reuse window.
A double-checked locking pattern around the concurrency semaphore prevents redundant STS calls when multiple goroutines miss the cache simultaneously for the same key.
Caching can be disabled per-runner via DisableAssumeRoleCredentialsCaching in [runners.cache.s3].

Three Prometheus metrics are added to aid observability:

gitlab_runner_cache_s3_assume_role_cache_hits_total (counter)
gitlab_runner_cache_s3_assume_role_cache_misses_total (counter)
gitlab_runner_cache_s3_assume_role_cached_credentials (gauge)

All new metrics are documented in advanced-configuration.md alongside the existing AssumeRole metrics.

Co-Authored-By: Claude Sonnet 4.6 noreply@anthropic.com

Why was this MR needed?

Relates to #39327 (closed)

What's the best way to test this MR?

Configure a RoleARN in the TOML:

  [runners.cache]
    Type = "s3"
    MaxUploadedArchiveSize = 0
    [runners.cache.s3]
    RoleARN = "arn:aws:iam::123456789:role/your-role"
    BucketName = "your-bucket"
    BucketLocation = "us-east-1"

Create a .gitlab-ci.yml with many parallel jobs:

build-job:       # This job runs in the build stage, which runs first.
  stage: build
  script:
    - echo "hello" > test.txt
  cache:
    paths:
      - test.txt
  parallel: 200

Run the runner with --listen-address localhost:9111, such as:

rm out/binaries/gitlab-runner-linux-amd64; make out/binaries/gitlab-runner-linux-amd64
./out/binaries/gitlab-runner-linux-amd64 run -c s3.toml --listen-address localhost:9111

Run the pipeline with the changes in 2, then run curl -s http://localhost:9111/metrics | grep s3_assume:

$ curl -s http://localhost:9111/metrics | grep s3_assume
# HELP gitlab_runner_cache_s3_assume_role_cache_hits_total Number of AssumeRole credential cache hits.
# TYPE gitlab_runner_cache_s3_assume_role_cache_hits_total counter
gitlab_runner_cache_s3_assume_role_cache_hits_total 398
# HELP gitlab_runner_cache_s3_assume_role_cache_misses_total Number of AssumeRole credential cache misses (requests that reached STS).
# TYPE gitlab_runner_cache_s3_assume_role_cache_misses_total counter
gitlab_runner_cache_s3_assume_role_cache_misses_total 2
# HELP gitlab_runner_cache_s3_assume_role_cached_credentials Current number of AssumeRole credentials held in the LRU cache.
# TYPE gitlab_runner_cache_s3_assume_role_cached_credentials gauge
gitlab_runner_cache_s3_assume_role_cached_credentials 2
# HELP gitlab_runner_cache_s3_assume_role_duration_seconds Duration of AssumeRole API calls to AWS STS.
# TYPE gitlab_runner_cache_s3_assume_role_duration_seconds histogram
gitlab_runner_cache_s3_assume_role_duration_seconds_bucket{le="0.05"} 2
gitlab_runner_cache_s3_assume_role_duration_seconds_bucket{le="0.1"} 2
gitlab_runner_cache_s3_assume_role_duration_seconds_bucket{le="0.25"} 2
gitlab_runner_cache_s3_assume_role_duration_seconds_bucket{le="0.5"} 2
gitlab_runner_cache_s3_assume_role_duration_seconds_bucket{le="1"} 2
gitlab_runner_cache_s3_assume_role_duration_seconds_bucket{le="2.5"} 2
gitlab_runner_cache_s3_assume_role_duration_seconds_bucket{le="5"} 2
gitlab_runner_cache_s3_assume_role_duration_seconds_bucket{le="10"} 2
gitlab_runner_cache_s3_assume_role_duration_seconds_bucket{le="30"} 2
gitlab_runner_cache_s3_assume_role_duration_seconds_bucket{le="+Inf"} 2
gitlab_runner_cache_s3_assume_role_duration_seconds_sum 0.089533895
gitlab_runner_cache_s3_assume_role_duration_seconds_count 2
# HELP gitlab_runner_cache_s3_assume_role_requests_in_flight Number of AssumeRole requests to AWS STS in progress.
# TYPE gitlab_runner_cache_s3_assume_role_requests_in_flight gauge
gitlab_runner_cache_s3_assume_role_requests_in_flight 0
# HELP gitlab_runner_cache_s3_assume_role_wait_seconds Wait time to acquire a concurrency slot before an AssumeRole request.
# TYPE gitlab_runner_cache_s3_assume_role_wait_seconds histogram
gitlab_runner_cache_s3_assume_role_wait_seconds_bucket{le="0.005"} 2
gitlab_runner_cache_s3_assume_role_wait_seconds_bucket{le="0.01"} 2
gitlab_runner_cache_s3_assume_role_wait_seconds_bucket{le="0.025"} 2
gitlab_runner_cache_s3_assume_role_wait_seconds_bucket{le="0.05"} 2
gitlab_runner_cache_s3_assume_role_wait_seconds_bucket{le="0.1"} 2
gitlab_runner_cache_s3_assume_role_wait_seconds_bucket{le="0.25"} 2
gitlab_runner_cache_s3_assume_role_wait_seconds_bucket{le="0.5"} 2
gitlab_runner_cache_s3_assume_role_wait_seconds_bucket{le="1"} 2
gitlab_runner_cache_s3_assume_role_wait_seconds_bucket{le="2.5"} 2
gitlab_runner_cache_s3_assume_role_wait_seconds_bucket{le="5"} 2
gitlab_runner_cache_s3_assume_role_wait_seconds_bucket{le="10"} 2
gitlab_runner_cache_s3_assume_role_wait_seconds_bucket{le="+Inf"} 2
gitlab_runner_cache_s3_assume_role_wait_seconds_sum 3.307e-06
gitlab_runner_cache_s3_assume_role_wait_seconds_count 2

What are the relevant issue numbers?

Edited Mar 24, 2026 by Stan Hu

Cache AssumeRole credentials to reduce STS requests

What does this MR do?

Why was this MR needed?

What's the best way to test this MR?

What are the relevant issue numbers?

Merge request reports