Rate-limit and instrument S3 AssumeRole calls
What does this MR do?
AWS blackholes AssumeRole requests when too many are issued concurrently. Cap in-flight calls using a buffered-channel semaphore, defaulting to 5. The limit is configurable via AssumeRoleMaxConcurrency (CACHE_S3_ASSUME_ROLE_MAX_CONCURRENCY); set to -1 to disable it entirely. Context cancellation while waiting for a slot is returned immediately as an error.
Add three Prometheus metrics to aid observability:
- gitlab_runner_cache_s3_assume_role_requests_in_flight (gauge)
- gitlab_runner_cache_s3_assume_role_wait_seconds (histogram)
- gitlab_runner_cache_s3_assume_role_duration_seconds (histogram)
Metrics are self-registered via a new cache.RegisterCollector/Collectors API that mirrors the existing FactoriesMap pattern, so multi.go only needs to import the cache package rather than specific adapters.
Relates to #39327 (closed)
Why was this MR needed?
See #39327 (closed)
What's the best way to test this MR?
- Configure an S3 cache with
RoleARN. - Set
concurrent = 105andrequest_concurrency = 100. - Run this runner binary with
listen-address:
./out/binaries/gitlab-runner-linux-amd64 run -c s3.toml --listen-address localhost:9999
- Launch a pipeline with many jobs, such as:
build-job: # This job runs in the build stage, which runs first.
stage: build
script:
- echo "hello" > test.txt
cache:
paths:
- test.txt
parallel: 200
-
curl -s http://localhost:9999/metrics | grep s3. Example:
$ curl -s http://localhost:9999/metrics | grep s3
# HELP gitlab_runner_cache_s3_assume_role_duration_seconds Duration of AssumeRole API calls to AWS STS.
# TYPE gitlab_runner_cache_s3_assume_role_duration_seconds histogram
gitlab_runner_cache_s3_assume_role_duration_seconds_bucket{le="0.05"} 297
gitlab_runner_cache_s3_assume_role_duration_seconds_bucket{le="0.1"} 500
gitlab_runner_cache_s3_assume_role_duration_seconds_bucket{le="0.25"} 500
gitlab_runner_cache_s3_assume_role_duration_seconds_bucket{le="0.5"} 500
gitlab_runner_cache_s3_assume_role_duration_seconds_bucket{le="1"} 500
gitlab_runner_cache_s3_assume_role_duration_seconds_bucket{le="2.5"} 500
gitlab_runner_cache_s3_assume_role_duration_seconds_bucket{le="5"} 500
gitlab_runner_cache_s3_assume_role_duration_seconds_bucket{le="10"} 500
gitlab_runner_cache_s3_assume_role_duration_seconds_bucket{le="30"} 500
gitlab_runner_cache_s3_assume_role_duration_seconds_bucket{le="+Inf"} 500
gitlab_runner_cache_s3_assume_role_duration_seconds_sum 23.682656039000026
gitlab_runner_cache_s3_assume_role_duration_seconds_count 500
# HELP gitlab_runner_cache_s3_assume_role_requests_in_flight Number of AssumeRole requests to AWS STS currently in flight.
# TYPE gitlab_runner_cache_s3_assume_role_requests_in_flight gauge
gitlab_runner_cache_s3_assume_role_requests_in_flight 0
# HELP gitlab_runner_cache_s3_assume_role_wait_seconds Time spent waiting to acquire a concurrency slot before issuing an AssumeRole request.
# TYPE gitlab_runner_cache_s3_assume_role_wait_seconds histogram
gitlab_runner_cache_s3_assume_role_wait_seconds_bucket{le="0.005"} 500
gitlab_runner_cache_s3_assume_role_wait_seconds_bucket{le="0.01"} 500
gitlab_runner_cache_s3_assume_role_wait_seconds_bucket{le="0.025"} 500
gitlab_runner_cache_s3_assume_role_wait_seconds_bucket{le="0.05"} 500
gitlab_runner_cache_s3_assume_role_wait_seconds_bucket{le="0.1"} 500
gitlab_runner_cache_s3_assume_role_wait_seconds_bucket{le="0.25"} 500
gitlab_runner_cache_s3_assume_role_wait_seconds_bucket{le="0.5"} 500
gitlab_runner_cache_s3_assume_role_wait_seconds_bucket{le="1"} 500
gitlab_runner_cache_s3_assume_role_wait_seconds_bucket{le="2.5"} 500
gitlab_runner_cache_s3_assume_role_wait_seconds_bucket{le="5"} 500
gitlab_runner_cache_s3_assume_role_wait_seconds_bucket{le="10"} 500
gitlab_runner_cache_s3_assume_role_wait_seconds_bucket{le="+Inf"} 500
gitlab_runner_cache_s3_assume_role_wait_seconds_sum 0.0015058440000000005
gitlab_runner_cache_s3_assume_role_wait_seconds_count 500
What are the relevant issue numbers?
Edited by Stan Hu