API cache should account for pre-signed URL expiry
Summary
Contexts:
- Domain source configuration - https://docs.gitlab.com/ee/administration/pages/index.html#domain-source-configuration
- GitLab API cache for Domain source configuration - https://docs.gitlab.com/ee/administration/pages/index.html#gitlab-api-cache-configuration
- Pages with Object Storage: https://docs.gitlab.com/ee/administration/pages/index.html#using-object-storage
When using the domain source configuration feature, a GitLab API cache is used to reduce the number of calls made to the GitLab backend.
Part of this cached API response is the URL to the actual pages artifact, which is a pre-signed authorized URL when object storage is in play.
GitLab internally assumes that all the URLs it provides as part of the API response is valid up to one day: https://gitlab.com/gitlab-org/gitlab/blob/ce9b9317a9116995f2a9603e628787effca6f0dc/app/models/pages/lookup_path.rb#L28-41
Under normal conditions, the URL provided by GitLab to Pages should be OK to use for a good few minutes (as long as the cache holds it). This is because the generated token in the URL does not normally expire until 1 day has passed.
However, when using service account authentication, such as the AssumeRoleWithWebIdentity
category defined at https://docs.aws.amazon.com/IAM/latest/UserGuide/id_roles_use.html when using AWS S3 as object storage, the generated URLs can expire at any time (well before their actual expiry date of 1 day later) because the source authentication only lasts up to 1 hour.
From https://docs.aws.amazon.com/AmazonS3/latest/userguide/ShareObjectPreSignedURL.html
"If you created a presigned URL using a temporary token, then the URL expires when the token expires, even if the URL was created with a later expiration time."
When this occurs, the cache still thinks the entry is valid and tries to use the URL only to generate a 400: Bad Request
response from the object storage service. The user is then sent a 500 error response from the pages service and the page load fails to work.
Steps to reproduce
- Setup GitLab using its Helm Chart on a Kubernetes cluster
- Configure use of AWS S3 as the object storage provider
- Use IAM authentication through service accounts, so the actual auth tokens are temporary (1h): gitlab-org/charts/gitlab#1832 (closed)
- Try to repeatedly load any deployed GitLab Page for about 61+ minutes
Example Project
This is not limited to a specific project
What is the current bug behavior?
Page requests fail with a 500 error code response. Backend logs a 400 bad request
failure in reading from object storage.
What is the expected correct behavior?
Page requests pass. Backend silently retires with a newly fetched URL after it encounters a 400 bad request
.
Relevant logs and/or screenshots
{"correlation_id":"01FS4TV34SQQW3TJE7MVMPPT23","error":"httprange: new resource 400: \"400 Bad Request\"","level":"trace","msg":"Root call","path":"https://bucket.s3.eu-west-1.amazonaws.com/hashed/path/pages_deployments/1/artifacts.zip?X-Amz-Expires=86400\u0026X-Amz-Date=20220111T150616Z\u0026X-Amz-Security-Token=…\u0026X-Amz-Algorithm=AWS4-HMAC-SHA256\u0026X-Amz-Credential=CREDENTIAL%2F20220111%2Feu-west-1%2Fs3%2Faws4_request\u0026X-Amz-SignedHeaders=host\u0026X-Amz-Signature=HASH","time":"2022-01-11T15:06:57Z","vfs":"zip"}
{"correlation_id":"01FS4TV34SQQW3TJE7MVMPPT23","error":"httprange: new resource 400: \"400 Bad Request\"","host":"group.gitlab-pages.example.com","level":"error","msg":"vfs.Root","path":"/","time":"2022-01-11T15:06:57Z"}
Output of checks
This issue was observed on GitLab 14.4 and GitLab 14.5
Workaround
Lower gitlab_cache_expiry
to 1 second
Possible fixes
Silently retry a new URL fetch upon failure in using the cached URL, perhaps somewhere in https://gitlab.com/gitlab-org/gitlab-pages/blob/b16bf8296b4d3319b32e74046b1aae3e21e2a947/internal/source/gitlab/gitlab.go or its callers?
~"devops::release" ~"group::release" Category:Pages