Skip to content

User's API rate limit affected other CI jobs

Summary

In our self-managed GitLab instance (16.11), we have recently faced a situation when one of our users reached the rate limit, and it somehow affected other CI jobs.

Steps to reproduce

  1. Configure rate limits
  2. Run CI jobs that interacts with container_registry using personal access token
  3. Reach and exceed the limit
  4. Try to run another not related job having the same IP (being behind nat or being having both runners on same node)

What is the current bug behavior?

The user used their own API token in CI job for experiments with some automation of the container registry in their personal projects. Due to a non-optimal algorithm, the number of requests were higher than the configured limit (default values). The user started to get the HTTP 429 errors indicating this, which is expected.

The logs from ingress controller clearly indicates that 429s were for user account:

"GET /jwt/auth?account=<UID HERE>&scope=repository%3A<REPOSITORY HERE>%3Apull&service=container_registry HTTP/1.1" 429 12 "-" "docker/26.1.3 go/go1.21.10

But at the same time, CI jobs from across the GitLab instance started to fail with the following lines in logs:

Reinitialized existing Git repository in <path here>/.git/
remote: Retry later
fatal: unable to access 'https://<our gitlab url>/<path to project>.git/': The requested URL returned error: 429
Cleaning up project directory and file based variables
ERROR: Job failed: exit code 1

Which is not actually expected as those jobs are not related anyhow and are located in another group/projects.

What is the expected correct behavior?

According to documentation (https://docs.gitlab.com/ee/administration/settings/user_and_ip_rate_limits.html), authenticated API requests limit should be applied on per user basis. But from the behaviour we observed it seems the limit is per IP.

Our on-prem agents pool is behind NAT and (most probably) agents have the same external IP, and our GitLab instance is GCP based.

So if the limit is per-IP it would explain the behaviour. Please, could it be clarified?

Edited by Marcel Amirault