Rate-limit and instrument S3 AssumeRole calls (!6528) · Merge requests · GitLab.org / gitlab-runner

What does this MR do?

AWS blackholes AssumeRole requests when too many are issued concurrently. Cap in-flight calls using a buffered-channel semaphore, defaulting to 5. The limit is configurable via AssumeRoleMaxConcurrency (CACHE_S3_ASSUME_ROLE_MAX_CONCURRENCY); set to -1 to disable it entirely. Context cancellation while waiting for a slot is returned immediately as an error.

Add three Prometheus metrics to aid observability:

gitlab_runner_cache_s3_assume_role_requests_in_flight (gauge)
gitlab_runner_cache_s3_assume_role_wait_seconds (histogram)
gitlab_runner_cache_s3_assume_role_duration_seconds (histogram)

Metrics are self-registered via a new cache.RegisterCollector/Collectors API that mirrors the existing FactoriesMap pattern, so multi.go only needs to import the cache package rather than specific adapters.

Relates to #39327 (closed)

Why was this MR needed?

See #39327 (closed)

What's the best way to test this MR?

Configure an S3 cache with RoleARN.
Set concurrent = 105 and request_concurrency = 100.
Run this runner binary with listen-address:

./out/binaries/gitlab-runner-linux-amd64 run -c s3.toml --listen-address localhost:9999

Launch a pipeline with many jobs, such as:

build-job:       # This job runs in the build stage, which runs first.
  stage: build
  script:
    - echo "hello" > test.txt
  cache:
    paths:
      - test.txt
  parallel: 200

curl -s http://localhost:9999/metrics | grep s3. Example:

$ curl -s http://localhost:9999/metrics | grep s3
# HELP gitlab_runner_cache_s3_assume_role_duration_seconds Duration of AssumeRole API calls to AWS STS.
# TYPE gitlab_runner_cache_s3_assume_role_duration_seconds histogram
gitlab_runner_cache_s3_assume_role_duration_seconds_bucket{le="0.05"} 297
gitlab_runner_cache_s3_assume_role_duration_seconds_bucket{le="0.1"} 500
gitlab_runner_cache_s3_assume_role_duration_seconds_bucket{le="0.25"} 500
gitlab_runner_cache_s3_assume_role_duration_seconds_bucket{le="0.5"} 500
gitlab_runner_cache_s3_assume_role_duration_seconds_bucket{le="1"} 500
gitlab_runner_cache_s3_assume_role_duration_seconds_bucket{le="2.5"} 500
gitlab_runner_cache_s3_assume_role_duration_seconds_bucket{le="5"} 500
gitlab_runner_cache_s3_assume_role_duration_seconds_bucket{le="10"} 500
gitlab_runner_cache_s3_assume_role_duration_seconds_bucket{le="30"} 500
gitlab_runner_cache_s3_assume_role_duration_seconds_bucket{le="+Inf"} 500
gitlab_runner_cache_s3_assume_role_duration_seconds_sum 23.682656039000026
gitlab_runner_cache_s3_assume_role_duration_seconds_count 500
# HELP gitlab_runner_cache_s3_assume_role_requests_in_flight Number of AssumeRole requests to AWS STS currently in flight.
# TYPE gitlab_runner_cache_s3_assume_role_requests_in_flight gauge
gitlab_runner_cache_s3_assume_role_requests_in_flight 0
# HELP gitlab_runner_cache_s3_assume_role_wait_seconds Time spent waiting to acquire a concurrency slot before issuing an AssumeRole request.
# TYPE gitlab_runner_cache_s3_assume_role_wait_seconds histogram
gitlab_runner_cache_s3_assume_role_wait_seconds_bucket{le="0.005"} 500
gitlab_runner_cache_s3_assume_role_wait_seconds_bucket{le="0.01"} 500
gitlab_runner_cache_s3_assume_role_wait_seconds_bucket{le="0.025"} 500
gitlab_runner_cache_s3_assume_role_wait_seconds_bucket{le="0.05"} 500
gitlab_runner_cache_s3_assume_role_wait_seconds_bucket{le="0.1"} 500
gitlab_runner_cache_s3_assume_role_wait_seconds_bucket{le="0.25"} 500
gitlab_runner_cache_s3_assume_role_wait_seconds_bucket{le="0.5"} 500
gitlab_runner_cache_s3_assume_role_wait_seconds_bucket{le="1"} 500
gitlab_runner_cache_s3_assume_role_wait_seconds_bucket{le="2.5"} 500
gitlab_runner_cache_s3_assume_role_wait_seconds_bucket{le="5"} 500
gitlab_runner_cache_s3_assume_role_wait_seconds_bucket{le="10"} 500
gitlab_runner_cache_s3_assume_role_wait_seconds_bucket{le="+Inf"} 500
gitlab_runner_cache_s3_assume_role_wait_seconds_sum 0.0015058440000000005
gitlab_runner_cache_s3_assume_role_wait_seconds_count 500

What are the relevant issue numbers?

Edited Mar 18, 2026 by Stan Hu

Rate-limit and instrument S3 AssumeRole calls

What does this MR do?

Why was this MR needed?

What's the best way to test this MR?

What are the relevant issue numbers?

Merge request reports