Serve directly from artifacts in object storage
Prerequisites
- Pages site artifact zip files are no longer consumed https://gitlab.com/gitlab-org/gitlab-ce/issues/45888
- We regenerate any Pages artifact zip files that were already consumed https://gitlab.com/gitlab-org/gitlab-ee/issues/9346
- We have Pages pull config from the Rails API #161 (closed)
Proposal
- Pages receives a request for a site resource.
- Pages does a lookup in configs as usual.
- Pages basically proxies the artifact files in object storage.
After MVP
- Pages caches the proxied files
- Pages translates symlinks into redirects (if we feel this is important enough behavior to add back)
Pros
- ~Geo doesn't need to "sync" anything at all.
- Pages becomes HA and scalable?
Cons
- Object storage is required.
- Pages site symlinks would break, but it may be possible to reimplement the behavior by translating them to redirects.
- Pages would need to keep the "old" way as well. To support small, simple instances, e.g. Raspberry Pi.
- How would we transition?
More related discussion
GitLab Pages direction doc (internal link): https://docs.google.com/document/d/18awpT5MVhlmdX0erO1X__Od59KZvrXBbdV0HG5A7WZk/edit#
@brodock https://gitlab.com/gitlab-org/gitlab-ee/issues/4611#note_88542926:
I built a similar infra for hosting landing-pages at my previous company. The use-case is very similar to ours and it was also inspired in how GitHub Pages works.
The endpoint that served the pages were proxying requests to the S3 bucket, based on the domain we would find the base folder and then try to look for files. Because S3 calls can be expensive, I've also added a short-lived TTL cache between the twos, so any spike in traffic would not mean multiple requests to the object storage.
Relevant information is here: http://shipit.resultadosdigitais.com.br/blog/do-apache-ao-go-como-melhoramos-nossas-landing-pages/ (it's in portuguese but google translator does a really good job).
For that specific use-case the Domain mapping was kept in both the database (for persistence reasons) and in Redis (so it is fast). Redis was being used not as cache but as main source of truth for that.
@nick.thomas https://gitlab.com/gitlab-org/gitlab-ee/issues/4611#note_101235573:
Serving the contents of the artifacts directly from object storage does several undesirable things from my point of view:
- Makes object storage mandatory for Pages (unnecessary complexity for small sites)
- Requires many changes in the Pages daemon
- Breaks current symlink support, so breaking existing pages deployments
Ultimately, though, the route we take there is up to ~Release and @jlenny.
We could serve pages directly from ZIP archives, but loading all of the metadata is IO and memory consuming operation so it is not worthy.
Maybe the solution is to assume that pages access data behind object storage, always. We could then build pages and sidekiq to access object storage directly, not filesystem. Extract data there and update metadata to make pages to pick a new changes.
Can pages be just regular OAuth application? Can pages use general API to download artifacts? This is possible even today. Maybe we can just generate one time URLs to download artifacts, similar how you can sign S3 URLs.