API cache should account for pre-signed URL expiry

Summary

Contexts:

Domain source configuration - https://docs.gitlab.com/ee/administration/pages/index.html#domain-source-configuration
GitLab API cache for Domain source configuration - https://docs.gitlab.com/ee/administration/pages/index.html#gitlab-api-cache-configuration
Pages with Object Storage: https://docs.gitlab.com/ee/administration/pages/index.html#using-object-storage

When using the domain source configuration feature, a GitLab API cache is used to reduce the number of calls made to the GitLab backend.

Part of this cached API response is the URL to the actual pages artifact, which is a pre-signed authorized URL when object storage is in play.

GitLab internally assumes that all the URLs it provides as part of the API response is valid up to one day: https://gitlab.com/gitlab-org/gitlab/blob/ce9b9317a9116995f2a9603e628787effca6f0dc/app/models/pages/lookup_path.rb#L28-41

Under normal conditions, the URL provided by GitLab to Pages should be OK to use for a good few minutes (as long as the cache holds it). This is because the generated token in the URL does not normally expire until 1 day has passed.

However, when using service account authentication, such as the AssumeRoleWithWebIdentity category defined at https://docs.aws.amazon.com/IAM/latest/UserGuide/id_roles_use.html when using AWS S3 as object storage, the generated URLs can expire at any time (well before their actual expiry date of 1 day later) because the source authentication only lasts up to 1 hour.

From https://docs.aws.amazon.com/AmazonS3/latest/userguide/ShareObjectPreSignedURL.html

"If you created a presigned URL using a temporary token, then the URL expires when the token expires, even if the URL was created with a later expiration time."

When this occurs, the cache still thinks the entry is valid and tries to use the URL only to generate a 400: Bad Request response from the object storage service. The user is then sent a 500 error response from the pages service and the page load fails to work.

Steps to reproduce

Setup GitLab using its Helm Chart on a Kubernetes cluster
Configure use of AWS S3 as the object storage provider
Use IAM authentication through service accounts, so the actual auth tokens are temporary (1h): gitlab-org/charts/gitlab#1832 (closed)
Try to repeatedly load any deployed GitLab Page for about 61+ minutes

Example Project

This is not limited to a specific project

What is the current bug behavior?

Page requests fail with a 500 error code response. Backend logs a 400 bad request failure in reading from object storage.

What is the expected correct behavior?

Page requests pass. Backend silently retires with a newly fetched URL after it encounters a 400 bad request.

Relevant logs and/or screenshots

{"correlation_id":"01FS4TV34SQQW3TJE7MVMPPT23","error":"httprange: new resource 400: \"400 Bad Request\"","level":"trace","msg":"Root call","path":"https://bucket.s3.eu-west-1.amazonaws.com/hashed/path/pages_deployments/1/artifacts.zip?X-Amz-Expires=86400\u0026X-Amz-Date=20220111T150616Z\u0026X-Amz-Security-Token=…\u0026X-Amz-Algorithm=AWS4-HMAC-SHA256\u0026X-Amz-Credential=CREDENTIAL%2F20220111%2Feu-west-1%2Fs3%2Faws4_request\u0026X-Amz-SignedHeaders=host\u0026X-Amz-Signature=HASH","time":"2022-01-11T15:06:57Z","vfs":"zip"}

{"correlation_id":"01FS4TV34SQQW3TJE7MVMPPT23","error":"httprange: new resource 400: \"400 Bad Request\"","host":"group.gitlab-pages.example.com","level":"error","msg":"vfs.Root","path":"/","time":"2022-01-11T15:06:57Z"}

Output of checks

This issue was observed on GitLab 14.4 and GitLab 14.5

Workaround

Lower gitlab_cache_expiry to 1 second

Possible fixes

Silently retry a new URL fetch upon failure in using the cached URL, perhaps somewhere in https://gitlab.com/gitlab-org/gitlab-pages/blob/b16bf8296b4d3319b32e74046b1aae3e21e2a947/internal/source/gitlab/gitlab.go or its callers?

~"devops::release" ~"group::release" Category:Pages

Edited Jan 12, 2022 by Harsh Chouraria