Eat gitlab-pages

Background

We're currently considering the next step for GitLab Pages, architecturally speaking. Whichever way we go from the various options, which are detailed more in https://docs.google.com/document/d/18awpT5MVhlmdX0erO1X__Od59KZvrXBbdV0HG5A7WZk/edit# and the following issues:

gitlab-pages#195 (closed) (serve directly from object storage)
gitlab-pages#196 (closed) (serve indirectly from artifacts, which may or may not be in object storage)

they have one thing in common - we're going to start making an API request from the Pages daemon to GitLab for (almost) every HTTP request. This API request is, more or less, a thin wrapper around a database lookup. We don't want the Pages daemon to have access to the database

History

Talking to GitLab for every request invalidates some long-standing assumptions about the desirability of gitlab-pages being a standalone component. I think it makes sense to evaluate whether to take its functionality and integrate it into the main gitlab-ce codebase at this crossroads. The core functionality of Pages is to serve up static HTML, CSS, JS, etc, files, currently held on disk. We're looking at instead holding them elsewhere, but the functionality itself is exceedingly simple.

Originally, we wanted Pages to keep working even if gitlab was down. To do this, we decomposed all the Pages artifacts and the config to a shared NFS filesystem, common to sidekiq, unicorn, and the pages daemon. Multiple Pages daemons run on that NFS root and poll for updates made by sidekiq and unicorn. However, this doesn't scale too well for GitLab.com.

The introduction of access control necessarily meant that if gitlab-ce is down, private pages sites stop working with the existing scheme. When we start reading domain config from API instead of disk per gitlab-pages#161 (closed), that property is simply gone. Meanwhile, we've made great strides in improving the uptime of GitLab itself, and the no-downtime upgrade path.

Proposal

We can move the simple "serve this file" functionality from the Pages daemon, into a Rails controller. All domain names would stay completely unchanged by this move. We'd modify the nginx config (and possibly gitlab-workhorse, although I think it's OK as-is) to pass any traffic for the "main" pages domain, or its subdomains, to the new rails controller, instead of to the pages daemon.

This change to NGINX provides a natural feature flag, allowing us to develop the new (artifact-backed) functionality without jeopardizing the old (filesystem-backed) version. People can opt into the new method to try it out, and back out if unhappy. Once we have confidence in the new way of things, we can retire the gitlab-pages codebase, make the nginx change unconditional, and stop syncing Pages changes to -pages-root.

I've started a WIP MR to show how gitlab-rails would serve the files here: https://gitlab.com/gitlab-org/gitlab-ce/merge_requests/24946

Custom Pages domains

Pages support includes custom domains. These cannot be served through nginx -> workhorse -> gitlab-ce because only gitlab-ce (and the pages daemon, at present) have the certificates. We'd need to open another listener (or pair, for HTTP+HTTPS), equivalent to the gitlab-pages one, and make it either directly accessible on a secondary IP, or TCP-proxied to by haproxy.

Connections to this port would be handled by workhorse, in principle identically to how gitlab-pages handles them. However, it would operate by asking gitlab-rails for the certificate+key for the presented SNI information, negotiate the TLS connection, and then serve the requests on that connection simply by passing them to gitlab-rails like any other pages-through-nginx-workhorse-and-rails request.

For me, this is the fun bit, so I put together the very start of a WIP MR here: gitlab-workhorse!362 (closed) - not functional at present, but i'm confident it can work.

"Online view of HTML artifacts"

This feature becomes 100x simpler. Rather than having an external daemon act as a HTTP proxy to the GitLab API, the new "pages" rails controller simply returns a gitlab-workhorse send-archive: response, which would be respected in the usual way.

Access control for pages

This can continue to work exactly as it does now, with an OAuth application. It might make it easier to manage the authorization between the two components, but that is all.

Security

The Pages daemon has its own set of security measures, mostly focusing on chroot(). These are fragile, and not especially well-understood. Some combinations of configurations simply don't work, and even when it is working perfectly, successful attacks can still steal other sites' TLS keys, access private pages, etc.

If we're in-process in rails, we more easily avoid filesystem operations, making the chroot() protections unnecessary. We never have to write anything to disk because we have direct, uncontroversial access to the database.

Performance

Cost per uncached request probably goes up a bit. However, total Pages request volume is low compared to GitLab.com Rails request volume, and the content is heavily cacheable in any case.

Maintaining high-availability characteristics

Currently, gitlab-pages is a separate, process, written in Go. This means that, in theory, it can be deployed independently of the main gitlab-rails service, survive some types of downtime - i.e., gitlab.com can be broken, but docs.gitlab.com continues to work. There is some disagreement over whether this is a feature that is worth preserving or not. The main interaction between pages and gitlab right now is an NFS mountpoint, which limits the types of HA deployment we can do.

We can preserve this characteristic while also eating gitlab-pages, by including a bin/gitlab-pages script in the gitlab-ce repository. When invoked, this would start a listener on localhost and run a very stripped-down version of the main rails application, which would only have the controllers and other code needed to serve pages on that listener. Workhorse then proxies Pages traffic to that listener, instead of to the main rails application.

A HA pages node would then run, either workhorse + this pages process, or just this process, with a workhorse from elsewhere pointing to it. No Internet traffic would ever reach it directly.

We don't need this in the initial implementation - people relying on HA pages can continue to use the old pages daemon until it's implemented - but it does show we can retain this use case.

In terms of memory use, Pages daemons on GitLab.com currently use ~400MiB each, as they cache large amounts of the GitLab dataset in RAM. So there's lots of scope for this HA proposal to use less RAM at scale.

In the case of small self-managed instances, negligible additional memory is used when pages is disabled. When pages is enabled, they lose the RAM usage of the existing gitlab-pages daemon (say ~40MiB minimum), without needing to add the HA daemon proposed in this section, so the footprint actually decreases. At GitLab.com scale, total RAM usage may also decrease.

Summary

@jlenny @stanhu @mkozono @nolith @ayufan @grzesiek @northrup @jacobvosmaer-gitlab sorry-not-sorry to muddy the waters with yet another pages-related proposal, but I think now is a good time to consider this direction. Yes, it's a rewrite, but the extent of the changes we're talking about making to the pages daemon are very broad already.

Absorbing gitlab-pages into the main gitlab-ce codebase:

Gives us a natural place to put a feature flag between old and new schemes
Reduces the chance of regressions while we're working on this to ~nil
Improves the maintainability of this feature for the long term
Makes it easier to deploy GitLab (fewer moving parts)
Doesn't harm the scalability of GitLab in-itself
Lowers the barrier to contributing to Pages

None of it is especially challenging from a technical point of view, and I think it deserves serious consideration. WDYT?

Task breakdown

If we're to go this route, I see the following tasks that need to be addressed. Roughly in order:

Omnibus
- Add a switch that turns off golang gitlab-pages in favour of new route
  - Modify nginx config so the traffic goes into workhorse instead
  - (custom domains) Modify workhorse config to get a second listener
- (maintain HA) supervise gitlab-ce/bin/gitlab-pages, allow for it to be the only enabled process
- (maintain HA) configure second Pages-only upstream in gitlab-workhorse
Finish gitlab-workhorse!362 (closed)
- Take "normal" Pages requests and proxy to gitlab-rails
- (custom domains) Add a second listener
- (maintain HA) allow a second upstream to be configured for pages requests
Finish https://gitlab.com/gitlab-org/gitlab-ce/merge_requests/24946
- Re-implement existing gitlab-pages logic for serving files, but from object storage
  - Prerequisite: https://gitlab.com/gitlab-org/gitlab-ce/issues/45888
  - Prerequisite: https://gitlab.com/gitlab-org/gitlab-ee/issues/9346
    - Instead of this, we could try to serve from -pages-root when an artifact isn't available as a strictly legacy approach. Probably OK for Geo.
  - When a request comes in, determine artifact to serve from based on domain and path
  - If the request is for "online view of HTML artifacts", handle it as if it's the API request we proxy
  - Apply any access control checks
  - Serve the file if it exists, following all existing rules about:
    - symlinks
    - precompressed files
    - path inference / precedence rules (e.g. /foo -> /foo/index.html or /foo.html)
    - ... all other existing pages features, one at a time ...
  - Custom 404 / other error page if necessary and present
  - (performance) maintain a temporary on-filesystem cache of served pages files
- (maintain HA) Write bin/gitlab-pages

Edited Feb 25, 2019 by Nick Thomas