Parallel jobs attempting to AssumeRoleWithWebIdentity in AWS IAM fail with `InvalidIdentityToken` error
Summary
When using GitLab as an OIDC provider to AWS in CI/CD pipelines, running jobs that make use of AssumeRoleWithWebIdentity in AWS IAM in parallel fails with InvalidIdentityToken error:
An error occurred (InvalidIdentityToken) when calling the AssumeRoleWithWebIdentity operation: Couldn't retrieve verification key from your identity provider, please reference AssumeRoleWithWebIdentity documentation for requirements
An example of this command is as follows:
$ STS=$(aws sts assume-role-with-web-identity --role-arn "$AWS_ROLE_ARN" --role-session-name "GitLabRunner-${CI_PROJECT_ID}-${ENV_NAME}-${CI_PIPELINE_ID}" --web-identity-token $CI_JOB_JWT_V2 --duration-seconds $STS_DURATION_SECONDS --query 'Credentials.[AccessKeyId,SecretAccessKey,SessionToken]' --output text)
A simple retry after hitting this error is sufficient as a workaround.
Example Project
Available internally to GitLab team members via Zendesk.
What is the current bug behavior?
Only one of the jobs attempting to AssumeRoleWithWebIdentity simultaneously will succeed in retrieving a verification key, others will fail and have to be retried.
What is the expected correct behavior?
Jobs run simultaneously should not fail without first retrying the call.
Relevant logs and/or screenshots
2022-08-31 11:42:31,533 - MainThread - urllib3.connectionpool - DEBUG - Starting new HTTPS connection (1): sts.amazonaws.com:443
2022-08-31 11:42:32,940 - MainThread - urllib3.connectionpool - DEBUG - https://sts.amazonaws.com:443 "POST / HTTP/1.1" 400 390
b'<ErrorResponse xmlns="https://sts.amazonaws.com/doc/2011-06-15/">\n <Error>\n <Type>Sender</Type>\n <Code>InvalidIdentityToken</Code>\n <Message>Couldn\'t retrieve verification key from your identity provider, please reference AssumeRoleWithWebIdentity documentation for requirements</Message>\n </Error>\n <RequestId>d04e9943-449a-442c-b17d-216276719b93</RequestId>\n</ErrorResponse>\n'
2022-08-31 11:42:32,942 - MainThread - botocore.hooks - DEBUG - Event needs-retry.sts.AssumeRoleWithWebIdentity: calling handler <botocore.retryhandler.RetryHandler object at 0x7f5782646f20>
2022-08-31 11:42:32,942 - MainThread - botocore.retryhandler - DEBUG - No retry needed.
Output of checks
This bug happens on GitLab.com
Possible fixes
Debug logs suggest that there's a retry handler built-in that could potentially be used to automatically retry this type of failure:
2022-08-31 11:42:32,942 - MainThread - botocore.hooks - DEBUG - Event needs-retry.sts.AssumeRoleWithWebIdentity: calling handler <botocore.retryhandler.RetryHandler object at 0x7f5782646f20>
2022-08-31 11:42:32,942 - MainThread - botocore.retryhandler - DEBUG - No retry needed.
One valid way to handle this would be to automatically retry the AWS call when running into this error.
Proposal
Add a response header for the JWKS endpoint: cache-control: public, max-age=18905, must-revalidate, no-transform