Serve *in*directly from artifacts
Prerequisites
- Pages site artifact zip files are no longer consumed https://gitlab.com/gitlab-org/gitlab-ce/issues/45888
- We regenerate any Pages artifact zip files that were already consumed https://gitlab.com/gitlab-org/gitlab-ee/issues/9346
- We have Pages pull config from the Rails API #161 (closed)
Proposal
- Pages receives a request for a site resource.
- Pages does a lookup in configs as usual.
- If the site exists, and Pages doesn't have the artifact.zip yet, it downloads it via the Rails API.
- Pages extracts the zip to a unique path in its filesystem.
- Pages serves the site from that path.
- Pages regularly checks for invalidation, so the filesystem acts as an LRU or other cache.
Pros
- ~Geo doesn't need to "sync" anything at all.
- Pages becomes HA and scalable?
- Pages site symlinks continue to work.
- Object storage is not required.
Cons
- If we are extracting artifact.zip on demand, there is a cold-cache issue, whenever a new request comes in, there is this initial download/extract time before it can handle requests, and this is shared by each "Pages" node. Site can be of GBs if it include a lot of images. It would 503 for a while.
- Invalidating/expiring is complex as you need to ping each machine to free space, or each one have to ping API in order to determine if they can or cannot remove folder from disk - O(N*M)?
More related discussion
GitLab Pages direction doc (internal link): https://docs.google.com/document/d/18awpT5MVhlmdX0erO1X__Od59KZvrXBbdV0HG5A7WZk/edit#
Once we have #78 (closed) done, we could rework the main pages functionality as a set of pointers to specific pages artifacts, accessed in the same way.
We'd need to stop deleting pages artifacts, and somehow regenerate the ones already deleted, of course, but then custom domains and the group / project pages can just become pointers to artifacts, with an optional filesystem cache to speed things up.
Once a given pages artifact is no longer the latest, it can expire according to the usual rules.
For content, I think we want to implement https://gitlab.com/gitlab-org/gitlab-ce/issues/45888. Modulo existing customer data (which could in principle be backfilled), this will ensure you can always get the current Pages content for a site from the GitLab API, (which may, of course, be serving a redirect to an archive in object storage).
Once we have this, we can treat the file store as a non-coherent temporary cache. If we're still interested in continued resilience while the GitLab API is unavailable, we can endeavour to keep it filled all the time. If the file store is lost, we can stand up a new, empty one, and the cache can be refilled from the GitLab API, either aggressively, or on-first-request.
If we have two of these backends, they don't have to share an NFS mount, and the loss of one won't cause an outage.