Rate-limit and instrument S3 AssumeRole calls

What does this MR do?

AWS blackholes AssumeRole requests when too many are issued concurrently. Cap in-flight calls using a buffered-channel semaphore, defaulting to 5. The limit is configurable via AssumeRoleMaxConcurrency (CACHE_S3_ASSUME_ROLE_MAX_CONCURRENCY); set to -1 to disable it entirely. Context cancellation while waiting for a slot is returned immediately as an error.

Add three Prometheus metrics to aid observability:

  • gitlab_runner_cache_s3_assume_role_requests_in_flight (gauge)
  • gitlab_runner_cache_s3_assume_role_wait_seconds (histogram)
  • gitlab_runner_cache_s3_assume_role_duration_seconds (histogram)

Metrics are self-registered via a new cache.RegisterCollector/Collectors API that mirrors the existing FactoriesMap pattern, so multi.go only needs to import the cache package rather than specific adapters.

Relates to #39327 (closed)

Why was this MR needed?

See #39327 (closed)

What's the best way to test this MR?

  1. Configure an S3 cache with RoleARN.
  2. Set concurrent = 105 and request_concurrency = 100.
  3. Run this runner binary with listen-address:
./out/binaries/gitlab-runner-linux-amd64 run -c s3.toml --listen-address localhost:9999
  1. Launch a pipeline with many jobs, such as:
build-job:       # This job runs in the build stage, which runs first.
  stage: build
  script:
    - echo "hello" > test.txt
  cache:
    paths:
      - test.txt
  parallel: 200
  1. curl -s http://localhost:9999/metrics | grep s3. Example:
$ curl -s http://localhost:9999/metrics | grep s3
# HELP gitlab_runner_cache_s3_assume_role_duration_seconds Duration of AssumeRole API calls to AWS STS.
# TYPE gitlab_runner_cache_s3_assume_role_duration_seconds histogram
gitlab_runner_cache_s3_assume_role_duration_seconds_bucket{le="0.05"} 297
gitlab_runner_cache_s3_assume_role_duration_seconds_bucket{le="0.1"} 500
gitlab_runner_cache_s3_assume_role_duration_seconds_bucket{le="0.25"} 500
gitlab_runner_cache_s3_assume_role_duration_seconds_bucket{le="0.5"} 500
gitlab_runner_cache_s3_assume_role_duration_seconds_bucket{le="1"} 500
gitlab_runner_cache_s3_assume_role_duration_seconds_bucket{le="2.5"} 500
gitlab_runner_cache_s3_assume_role_duration_seconds_bucket{le="5"} 500
gitlab_runner_cache_s3_assume_role_duration_seconds_bucket{le="10"} 500
gitlab_runner_cache_s3_assume_role_duration_seconds_bucket{le="30"} 500
gitlab_runner_cache_s3_assume_role_duration_seconds_bucket{le="+Inf"} 500
gitlab_runner_cache_s3_assume_role_duration_seconds_sum 23.682656039000026
gitlab_runner_cache_s3_assume_role_duration_seconds_count 500
# HELP gitlab_runner_cache_s3_assume_role_requests_in_flight Number of AssumeRole requests to AWS STS currently in flight.
# TYPE gitlab_runner_cache_s3_assume_role_requests_in_flight gauge
gitlab_runner_cache_s3_assume_role_requests_in_flight 0
# HELP gitlab_runner_cache_s3_assume_role_wait_seconds Time spent waiting to acquire a concurrency slot before issuing an AssumeRole request.
# TYPE gitlab_runner_cache_s3_assume_role_wait_seconds histogram
gitlab_runner_cache_s3_assume_role_wait_seconds_bucket{le="0.005"} 500
gitlab_runner_cache_s3_assume_role_wait_seconds_bucket{le="0.01"} 500
gitlab_runner_cache_s3_assume_role_wait_seconds_bucket{le="0.025"} 500
gitlab_runner_cache_s3_assume_role_wait_seconds_bucket{le="0.05"} 500
gitlab_runner_cache_s3_assume_role_wait_seconds_bucket{le="0.1"} 500
gitlab_runner_cache_s3_assume_role_wait_seconds_bucket{le="0.25"} 500
gitlab_runner_cache_s3_assume_role_wait_seconds_bucket{le="0.5"} 500
gitlab_runner_cache_s3_assume_role_wait_seconds_bucket{le="1"} 500
gitlab_runner_cache_s3_assume_role_wait_seconds_bucket{le="2.5"} 500
gitlab_runner_cache_s3_assume_role_wait_seconds_bucket{le="5"} 500
gitlab_runner_cache_s3_assume_role_wait_seconds_bucket{le="10"} 500
gitlab_runner_cache_s3_assume_role_wait_seconds_bucket{le="+Inf"} 500
gitlab_runner_cache_s3_assume_role_wait_seconds_sum 0.0015058440000000005
gitlab_runner_cache_s3_assume_role_wait_seconds_count 500

What are the relevant issue numbers?

Edited by Stan Hu

Merge request reports

Loading