We believe we have identified a bug in permission caching which can sometimes prevent users from pulling images from container registries. This can cause CI builds to fail.
We have identified a feature flag (ci_scoped_job_token) which introduced this behavior earlier today. We have disabled the feature flag which has resolved the problem.
21:25 - Only specific runner was confirmed impacted. Posted the second status update not public facing.
23:11 - Consistent internal replication of issue.
2021-07-16
01:18 - Continued investigation into root cause. Escalation to related development team leaders.
01:55 - Updated status page
02:31 - Disabled ci_scoped_job_token feature flag
02:35 - Confirmed permission errors are resolved
Corrective Actions
Corrective actions should be put here as soon as an incident is mitigated, ensure that all corrective actions mentioned in the notes below are included.
...
Note:
In some cases we need to redact information from public view. We only do this in a limited number of documented cases. This might include the summary, timeline or any other bits of information, laid out in out handbook page. Any of this confidential data will be in a linked issue, only visible internally.
By default, all information we can share, will be public, in accordance to our transparency value.
Click to expand or collapse the Incident Review section.
Incident Review
Summary
Service(s) affected:
Team attribution:
Time to detection:
Minutes downtime or degradation:
Metrics
Customer Impact
Who was impacted by this incident? (i.e. external customers, internal customers)
...
What was the customer experience during the incident? (i.e. preventing them from doing X, incorrect display of Y, ...)
...
How many customers were affected?
...
If a precise customer impact number is unknown, what is the estimated impact (number and ratio of failed requests, amount of traffic drop, ...)?
...
What were the root causes?
...
Incident Response Analysis
How was the incident detected?
...
How could detection time be improved?
...
How was the root cause diagnosed?
...
How could time to diagnosis be improved?
...
How did we reach the point where we knew how to mitigate the impact?
...
How could time to mitigation be improved?
...
What went well?
...
Post Incident Analysis
Did we have other events in the past with the same root cause?
...
Do we have existing backlog items that would've prevented or greatly reduced the impact of this incident?
...
Was this incident triggered by a change (deployment of code or change to infrastructure)? If yes, link the issue.
This project was scheduled for deletion, but failed with the following message: PG::SyntaxError: ERROR: each UNION query must have the same number of columns LINE 3: ..."namespaces"."id" IN (4909902, 12742051))) SELECT "members".... ^
Running with gitlab-runner 14.0.1 (c1edb478) on qa-runner-1626381200 _4iQRFx4Resolving secrets 00:00Preparing the "docker" executor 00:00Using Docker executor with image node:14-buster ...Authenticating with credentials from job payload (GitLab Registry)Pulling docker image registry.gitlab.com/gitlab-org/gitlab-runner/gitlab-runner-helper:x86_64-c1edb478 ...WARNING: Failed to pull image with policy "always": Error response from daemon: pull access denied for registry.gitlab.com/gitlab-org/gitlab-runner/gitlab-runner-helper, repository does not exist or may require 'docker login': denied: requested access to the resource is denied (manager.go:205:0s)ERROR: Job failed (system failure): failed to pull image "registry.gitlab.com/gitlab-org/gitlab-runner/gitlab-runner-helper:x86_64-c1edb478" with specified policies [always]: Error response from daemon: pull access denied for registry.gitlab.com/gitlab-org/gitlab-runner/gitlab-runner-helper, repository does not exist or may require 'docker login': denied: requested access to the resource is denied (manager.go:205:0s)
{"_index":"pubsub-registry-inf-gprd-001069","_type":"_doc","_id":"fL8OrHoBuGSvJmprhHVH","_version":1,"_score":null,"_source":{"@timestamp":"2021-07-15T21:24:29.721Z","kubernetes":{"region":"us-east1-d","namespace_name":"gitlab","container_image":"dev.gitlab.org:5005/gitlab/charts/components/images/gitlab-container-registry:v3.5.2-gitlab","pod_name":"gitlab-registry-5b8b87df49-jk6g9","host":"gke-gprd-us-east1-d-registry-1-ef830985-18ht","container_name":"registry"},"publish_time":"2021-07-15T21:24:29.605Z","json":{"migrating_repository":false,"use_database":false,"write_fs_metadata":true,"type":"registry","msg":"repository name not known to registry","time":"2021-07-15T21:24:24Z","tag":"gitlab_registry.var.log.containers.gitlab-registry-5b8b87df49-jk6g9_gitlab_registry-606cfa119dac41916b934cfff38d8011389e6220e899755b2a8cebd94371309a.log","root_repo":"gitlab-qa-sandbox-group","vars_name":"gitlab-qa-sandbox-group/qa-test-2021-07-15-20-30-18-fe986954512344f0/npm-project-d2d0657cbc921b3f","shard":"default","environment":"gprd","correlation_id":"01FAP0WVNFY3AE6Q2WEKVBEAGS","tier":"sv","go_version":"go1.16.4","stage":"main","level":"error","code":"NAME_UNKNOWN","detail":"map[name:gitlab-qa-sandbox-group/qa-test-2021-07-15-20-30-18-fe986954512344f0/npm-project-d2d0657cbc921b3f]","auth_user_name":"","error":"name unknown: repository name not known to registry"},"type":"pubsubbeat-pubsub-registry-inf-gprd-75bfff974f-nw65j","host":{"name":"pubsubbeat-pubsub-registry-inf-gprd-75bfff974f-nw65j"}},"fields":{"json.time":["2021-07-15T21:24:24.000Z"]},"highlight":{"json.detail":["map[name:gitlab-qa-sandbox-group/qa-test-2021-07-15-20-30-18-fe986954512344f0/@kibana-highlighted-field@npm@/kibana-highlighted-field@-@kibana-highlighted-field@project@/kibana-highlighted-field@-@kibana-highlighted-field@d2d0657cbc921b3f@/kibana-highlighted-field@]"],"json.vars_name":["gitlab-qa-sandbox-group/qa-test-2021-07-15-20-30-18-fe986954512344f0/@kibana-highlighted-field@npm@/kibana-highlighted-field@-@kibana-highlighted-field@project@/kibana-highlighted-field@-@kibana-highlighted-field@d2d0657cbc921b3f@/kibana-highlighted-field@"]},"sort":[1626384264000]}
...where REDACTED is the base64-encoded value of gitlab-ci-token:SOME TOKEN (though I had some trouble using base64 on the command-line, so I just used curl -v using http://username:password@host).
It appears that this feature is designed to restrict CI access to that project, but it is also restricting access to public projects.
Other way to reproduce the issue:
runner_project=Project.find_by_full_path('gitlab-org/cluster-integration/auto-build-image')project=Project.find_by_full_path('anton/incident-5174-test')build=Ci::Build.find(1428929992)# Mark this running if it's notuser=User.find_by_username('anton')RequestStore.begin!# This is needed for the CI scope tokenauth_result=Gitlab::Auth.find_for_git_client('gitlab-ci-token',build.token,project: nil,ip: '127.0.0.1')auth_params={scopes: ['repository:gitlab-org/cluster-integration/auto-build-image:pull'],service: 'container_registry'}result=::Auth::ContainerRegistryAuthenticationService.new(auth_result.project,auth_result.actor,auth_params).execute(authentication_abilities: auth_result.authentication_abilities)result[:token]# This is the JWT that can be decoded in https://jwt.io
@fabiopitino It seems that ci_scoped_job_token is related to this incident, so maybe we should also revert gitlab-org/gitlab!65848 (merged) before 14.1 gets cut so that it doesn't get shipped in the next self-managed release.
This incident was automatically closed because it has the IncidentResolved label.
Note: All incidents are closed automatically when they are resolved, even when there is a pending
review. Please see the Incident Workflow
section on the Incident Management handbook page for more information.
The new permission model for CI job token scope should have not impacted existing projects unless the project setting was enabled. We had enabled the feature flag the day before but we didn't observe issues with it be cause all projects had the setting disabled by default.
With gitlab-org/gitlab!65848 (merged) we enabled the setting by default for new projects and it seems when we started seeing issues. The QA failures on Canary could be caused by the fact that new projects had enabled job token scope and by default don't have access to all projects unless allowlisted in the job token scope (via project settings).
I'm wondering whether we observed this issue with existing projects, which may suggest that some logic got leaked outside the FF and project setting guards.
The project setting controls which projects are allowlisted if accessed via CI_JOB_TOKEN.
The feature flag controls whether the project setting is visible to users and whether authentications occur via CI_JOB_TOKEN - This is still the ultimate control point.
By disabling the feature flag we stopped tracking whether current_user was being authenticated via CI_JOB_TOKEN. This prevents any other permissions restrictions that could have been applied by the job token scope.
Right, I think we saw this with new projects trying to fetch container registry images hosted by existing projects. #5174 (comment 628261169) has more details.
@jreporter@cheryl.li with regards to public access resources we would need to revisit this part of the permissions. Similar discussion gitlab-org/gitlab#332272 (comment 628098918). I think we would need to make a change where public access resources are excluded from job token scope enforcement.
@jreporter@cheryl.li as per #5174 (comment 630173506) we haven't found (yet) data confirming that existing projects were impacted. This means that the rollout worked OK but there were some issues.
So far the issues observed during the rollout were 2:
Public projects should not be counted against the job token scope. Although this was by design we realized that it causes more harm than good. Remediation: gitlab-org/gitlab#336398 (closed)
QA failures were legit due to the fact that we create new projects all the time and those got the job token scope feature enabled by default - Will open an issue to fix the QA tests when we rollout the feature again.
@sabrams@10io could you help me understand how to trace authentication for docker pull after a successful docker login authentication using CI_JOB_TOKEN? I'm trying to figure out why this job was failing.
Do we use a different authentication route than other API requests?
@fabiopitino On the bottom of my note in #5174 (comment 628261169) is the sequence I used to reproduce the failure from the command-line. This simulates this authentication flow:
GitLab CI build starts running.
Runner attempts to pull an image from Docker with no JWT. The registry returns a 401 Unauthenticated.
Runner authenticates with /jwt/auth via HTTP Basic Authentication with gitlab-ci-token as the username and build token as the password. It passes a scope in the form repository:<project name>:pull (e.g. repository:gitlab-org/cluster-integration/auto-build-image:pull).
JwtController#auth returns a JWT with authorized actions. Note that even if no actions are allowed, the controller will return a 200.
@stanhu Yes, that makes sense. In #5174 (comment 628261169)project_allowed_for_job_token returned false because the project anton/incident-5174-test was a newly created project that inherited the project CI/CD setting job_token_scope_enabled: true. This means that by default a CI_JOB_TOKEN from anton/incident-5174-test can only access resources from the same project unless other projects are added/allowlisted to the job token scope.
As you highlighted, the problem with this is that public_user_access is not taken in consideration. We are fixing this.