Serve indirectly from artifacts

Prerequisites

Pages site artifact zip files are no longer consumed https://gitlab.com/gitlab-org/gitlab-ce/issues/45888
We regenerate any Pages artifact zip files that were already consumed https://gitlab.com/gitlab-org/gitlab-ee/issues/9346
We have Pages pull config from the Rails API #161 (closed)

Proposal

Pages receives a request for a site resource.
Pages does a lookup in configs as usual.
If the site exists, and Pages doesn't have the artifact.zip yet, it downloads it via the Rails API.
Pages extracts the zip to a unique path in its filesystem.
Pages serves the site from that path.
Pages regularly checks for invalidation, so the filesystem acts as an LRU or other cache.

Pros

~Geo doesn't need to "sync" anything at all.
Pages becomes HA and scalable?
Pages site symlinks continue to work.
Object storage is not required.

Cons

If we are extracting artifact.zip on demand, there is a cold-cache issue, whenever a new request comes in, there is this initial download/extract time before it can handle requests, and this is shared by each "Pages" node. Site can be of GBs if it include a lot of images. It would 503 for a while.
Invalidating/expiring is complex as you need to ping each machine to free space, or each one have to ping API in order to determine if they can or cannot remove folder from disk - O(N*M)?

More related discussion

GitLab Pages direction doc (internal link): https://docs.google.com/document/d/18awpT5MVhlmdX0erO1X__Od59KZvrXBbdV0HG5A7WZk/edit#

@nick.thomas

Once we have #78 (closed) done, we could rework the main pages functionality as a set of pointers to specific pages artifacts, accessed in the same way.

We'd need to stop deleting pages artifacts, and somehow regenerate the ones already deleted, of course, but then custom domains and the group / project pages can just become pointers to artifacts, with an optional filesystem cache to speed things up.

Once a given pages artifact is no longer the latest, it can expire according to the usual rules.

#158 (closed):

@nick.thomas

For content, I think we want to implement https://gitlab.com/gitlab-org/gitlab-ce/issues/45888. Modulo existing customer data (which could in principle be backfilled), this will ensure you can always get the current Pages content for a site from the GitLab API, (which may, of course, be serving a redirect to an archive in object storage).

Once we have this, we can treat the file store as a non-coherent temporary cache. If we're still interested in continued resilience while the GitLab API is unavailable, we can endeavour to keep it filled all the time. If the file store is lost, we can stand up a new, empty one, and the cache can be refilled from the GitLab API, either aggressively, or on-first-request.

If we have two of these backends, they don't have to share an NFS mount, and the loss of one won't cause an outage.

Edited Jan 28, 2019 by Michael Kozono

Serve *in*directly from artifacts

Prerequisites

Proposal

Pros

Cons

More related discussion

Serve indirectly from artifacts