Serve directly from artifacts in object storage

Prerequisites

Pages site artifact zip files are no longer consumed https://gitlab.com/gitlab-org/gitlab-ce/issues/45888
We regenerate any Pages artifact zip files that were already consumed https://gitlab.com/gitlab-org/gitlab-ee/issues/9346
We have Pages pull config from the Rails API #161 (closed)

Proposal

Pages receives a request for a site resource.
Pages does a lookup in configs as usual.
Pages basically proxies the artifact files in object storage.

After MVP

Pages caches the proxied files
Pages translates symlinks into redirects (if we feel this is important enough behavior to add back)

Pros

~Geo doesn't need to "sync" anything at all.
Pages becomes HA and scalable?

Cons

Object storage is required.
Pages site symlinks would break, but it may be possible to reimplement the behavior by translating them to redirects.
Pages would need to keep the "old" way as well. To support small, simple instances, e.g. Raspberry Pi.
How would we transition?

More related discussion

GitLab Pages direction doc (internal link): https://docs.google.com/document/d/18awpT5MVhlmdX0erO1X__Od59KZvrXBbdV0HG5A7WZk/edit#

@brodock https://gitlab.com/gitlab-org/gitlab-ee/issues/4611#note_88542926:

I built a similar infra for hosting landing-pages at my previous company. The use-case is very similar to ours and it was also inspired in how GitHub Pages works.

The endpoint that served the pages were proxying requests to the S3 bucket, based on the domain we would find the base folder and then try to look for files. Because S3 calls can be expensive, I've also added a short-lived TTL cache between the twos, so any spike in traffic would not mean multiple requests to the object storage.

Relevant information is here: http://shipit.resultadosdigitais.com.br/blog/do-apache-ao-go-como-melhoramos-nossas-landing-pages/ (it's in portuguese but google translator does a really good job).

For that specific use-case the Domain mapping was kept in both the database (for persistence reasons) and in Redis (so it is fast). Redis was being used not as cache but as main source of truth for that.

@nick.thomas https://gitlab.com/gitlab-org/gitlab-ee/issues/4611#note_101235573:

Serving the contents of the artifacts directly from object storage does several undesirable things from my point of view:

Makes object storage mandatory for Pages (unnecessary complexity for small sites)

Requires many changes in the Pages daemon

Breaks current symlink support, so breaking existing pages deployments

Ultimately, though, the route we take there is up to ~Release and @jlenny.

#68 (closed):

@ayufan

We could serve pages directly from ZIP archives, but loading all of the metadata is IO and memory consuming operation so it is not worthy.

Maybe the solution is to assume that pages access data behind object storage, always. We could then build pages and sidekiq to access object storage directly, not filesystem. Extract data there and update metadata to make pages to pick a new changes.

@ayufan

Can pages be just regular OAuth application? Can pages use general API to download artifacts? This is possible even today. Maybe we can just generate one time URLs to download artifacts, similar how you can sign S3 URLs.

Edited Jan 28, 2019 by Michael Kozono