compress html blobs sent from GitLab.com to browser

@marin @joshlambert is this something you can take a look at please?

I don't understand what this issue is about.

@marin, LOL... but I was hoping that the linked issue #33501 (closed) would provide the context you were seeking. The idea is to improve performance of GitLab.com from the user's perspective by compressing the HTML that is sent back to user's browsers to a gzip. It was suggested to me that this fits with the build team, perhaps because it would require packaging a compression tool to run in some logical step in the process (workhorse?).

@andrewn can you provide further insight into your original point / idea, since I don't seem to be doing it justice?

@ernstvn There is too much info in that issue :( Work on this needs to be done by Backend and only in part by Build.

@marin

My original point was that we should lightly compress dynamically generated text/json/html data on it's way out of nginx using the gzip on directive.

Additionally, AFAIK, assets are being served up from Workhorse.

Is there a reason that we don't serve these up directly from nginx?
Once/if we're serving assets directly from nginx, we should use gzip_static on static assets, and pre-compressing them with gzip -9 during the build process.

Is there a reason that we don't serve these up directly from nginx?

Yes. Nginx is not a hard requirement for GitLab stack. You can switch it out for any other web server. Omnibus GitLab ships Nginx, but you can disable the bundled web server and provide your own. So Workhorse is the last component in the stack we have control over when it comes to asset serving.

Thanks @marin

Let's discuss static content (assets) and dynamic content (application generated) separately...

Dynamic Content

If nginx is part of the stack (as it is on Gitlab.com) we can optionally get a performance gain by lightly compressing (level 1) content in nginx on it's way back to the client.

On html endpoints

curl https://gitlab.com/gitlab-org/gitlab-ce/ | wc -c => 13kB
curl https://gitlab.com/gitlab-org/gitlab-ce/ | gzip -1 |wc -c => 71kB

So, we could be sending back up to ~one fifth of the html data we are right now.

On json endpoints

curl -H 'Accept: application/json' https://gitlab.com/gitlab-org/gitlab-ce/issues/123|wc -c => 751
curl -H 'Accept: application/json' https://gitlab.com/gitlab-org/gitlab-ce/issues/123|gzip -1|wc -c => 1327

And a 40% data saving on JSON traffic.

Static Content

Since we're moving to a CDN for GitLab.com, the amount of static asset traffic that we will be serving from gitlab.com should drop drastically. I recommend that we wait until the CDN is implemented before discussing this further.

@marin

So Workhorse is the last component in the stack we have control over when it comes to asset serving.

Perhaps a silly question: can the compression happen at the Workhorse level, or does NGINX have to receive uncompressed?

@andrewn

thanks for that quick analysis. I'm still curious to know what the gain would be for static assets without a CDN. Depending on the level of effort involved, we may get around to doing this prior to having the CDN up, or the CDN may benefit from it in any case?

It would be nice, if feasible, to do this at the workhorse level so that it works independent of any load balancer / proxy setup. I think there are some gems to do this.

This would also afford us more control in case we run into any issues and need to exclude certain content from compression.

@joshlambert configuring this in nginx is a few lines of configuration, whereas re-implementing it in golang involves developing and testing a lot of mission-critical code. The nginx configuration is highly configurable and can easily be turned on or off without disrupting activity.

I would say let's use nginx as a first step at least.

@ernstvn right now, nginx is forwarding static asset requests to Workhorse, so this is hard to change without writing code. However, it would be easy to terminate the static assets requests in nginx instead of forwarding them on.

We do a similar thing in Gitter and it works very well and gives us the ability to configure precompressed gzip (and more recently brotli) assets.

Additionally, nginx is faster and more efficient than golang. Microbenchmarking them (disclaimer: for what that's worth) shows nginx can handle about 2x static asset requests as golang's static http module.

Without a CDN, we'll get a big advantage from terminating early when we can (ie, nginx is available). With a CDN, we'll still gain an advantage by lowering the latency on CDN cache misses.

@andrewn I get the feeling that your comments are neglecting the fact that we have to ship this to everyone. This is not only a few lines of change that can be applied on GitLab.com. That is, if we don't decide as a company to threat everyone with a different proxy differently. A lot of the Workhorse work was done to remove the dependence on a web-server specific features. Correct me if my feeling is wrong :)

@marin thanks for requesting clarification! I'm aware that nginx is not available everywhere but for those environments that it is available, I feel we should take advantage of it's features.

Not every site has the same scale requirements as GitLab.com, but for those that do, I feel we should leverage nginx rather than attempting to rebuild it, feature-by-feature, in Workhorse.

What I like about this setup is that, when nginx is running, it'll serve up static content directly, without forwarding the each request to Workhorse. When nginx is not available, Workhorse will continue to serve up the static content exactly as it does at present.

The downstream configuration remains the same, but nginx will simply offload the static requests when its there.

Does this make sense, or do you think I have an over-simplified view of the current configuration?

@andrewn I think we certainly all agree that this is something that will help our users, it will reduce bandwidth costs on both sides as well as increase performance and responsiveness. While smaller installations may not get as big of a bang for the buck, the performance/responsiveness benefits of faster asset downloads will still be a nice win.

In accordance with our goals of having best practices enabled for all and what we run on GitLab.com, we'd want to have this the default mode of operation for all of our users.

The challenges with relying on nginx for this, is that there are just a huge number of ways to deploy GitLab and some do not include it. The features and optimizations we put into a non-core service (like nginx) means that some customers may not receive the benefit, and further complicates our installation process. There will be more customers will have to do to optimize their reverse proxy, and we'll also have to sprinkle this across our documentation further complicating those.

So it's going to be a tradeoff between what I would think is best for customers and ease of spinning up a "best practice" GitLab installation, versus the effort spent to add it to workhorse and how much of a CPU cost it may be on this layer of the service.

Do you have a sense of the effort required to ship this in Go? Seems like there could be some libs to help. (https://github.com/NYTimes/gziphandler)

@joshlambert, thanks for your response!

My concern is that we're rebuilding nginx in Workhorse, and that seems like a classic case of NIH.

If the effort involved in both approaches were of a similar magnitude, I might agree with you, but it's definitely not.

To illustrate, here's the complete Gitter nginx compression configuration. It's only a dozen lines of configuration, but it covers a huge amount of complexity...

  ## Handle static asset compression differently from generated responses
  ## All non-image assets are served from `.gz` and `.br` files precompressed on
  ## the file-system, for faster response with least CPU overhead.
  location /assets/ {
    .... 

    brotli_static on;   # Use statically precompressed brolti assets modern browsers (smaller than gzip)
    gzip_static on;     # Use statically precompressed gzip assets for browsers that accept compression

    gzip_http_version   1.0;  # Only use compression for HTTP/1.0 clients
    gzip_proxied        any;  # Allow proxies to accept compressed responses
    gzip_disable        "MSIE [1-6]\."; # Ignore these browsers if they request compressed asssets
    gzip_vary           on;   # Be aware of 

    ... 
  }
  
  ## Application generated content
  location /api/ {
    ...
    gzip on;           # Compress responses on the fly
    gzip_types *;      # If theres a chance that we're serving up precompressed data, we would turn it off here
    gzip_comp_level 2; # Configure the compression level
    gzip_proxied any;  # Allow proxies to accept compressed responses

    ...
  }

Looking at the configuration options listed above, here's a list of things that we would need to be aware of if replacing nginx with Workhorse compression:

NB: Make sure that we're honouring Etags and HTTP conditional requests. These are very important for static assets as telling a browser they already have the right data (and therefore skipping the request entirely is always going to be fastest)
NB: Make sure to honour the Vary header otherwise downstream caches will serve compressed assets to clients not accepting compression and visa-versa. Also, once these caches are corrupted, it's very difficult to fix (they are not under our control)
NB: Static assets should not be compressed on the fly as the content doesn't change, so recompressing it on every request is a waste of resources, and slow.
(optional, but nice) Brotli compression (which achieves much higher compression rates) is nice to include if the browser accepts br encoding.
(optional, but useful) Don't use HTTP compression on old HTTP clients
(optional, but useful) Don't use HTTP compression for certain buggy browsers

ngx_http_gzip_filter_module.c gives you an idea of some (but not all) of the compexity involved in doing this right.

In the interests of delivering a minimum viable change, let's use nginx when we have it.

This will make a huge difference to GitLab.com customers (and also other customers using nginx) for relatively little effort compared to rebuilding ourselves.

\cc @timzallmann

So it sounds like our options at the moment are:

Enable it to work with NGINX - can be completed by next release (?) by 1 (?) developer.
Enable it to work from Workhorse - would require (how many? 2?) developers for (how long? 1 month?) to be shippable. We'd have to consider the opportunity cost as well, what are these developers otherwise working on?

If the estimate is 1 dev for 1 mth vs 2 devs for 1 mth, I'd say let's pick option 2, but @andrewn perhaps you want to revise my estimates upward?

GZIP can improve the performance a lot so i also rather have it faster at least for some very quickly with NGINX (which brings gzip and also some other nice things for web serving to the table) and replace it at a later stage. But you have the better insight.

We also should look intensively on delivering JSON through gzip.

I gathered some numbers and by example with a small merge request it could be a lot as codeclimate is delivering a 2.1 MB json payload to the FE (link : https://gitlab.com/gitlab-org/gitlab-ce/merge_requests/12419) :

mentioned in issue #33958 (moved)

Can we explicitly enable codeclimate.json compression?

One of the reasons I believe that gzipping is not enabled is possibility of doing CRIME/BREACH attacks: https://security.stackexchange.com/questions/102013/is-gzipping-content-via-tls-allowed, http://breachattack.com/

It is not a big deal for static assets, as they should not contain security tokens.

The data as static assets, images, build artifacts can and probably should be compressed, everything else that is dynamic: API jsons, htmls should not be.

@ayufan let's start off with using gzip_static compression on static assets for now then.

@ernstvn

If the estimate is 1 dev for 1 mth vs 2 devs for 1 mth, I'd say let's pick option 2, but @andrewn perhaps you want to revise my estimates upward?

I'm sure 2 devs could deliver something that worked within a month, but let's consider the opportunity costs:

This change will benefit high-volume/high-traffic installations the most
High volume installations will almost certainly already be running nginx in front of their workhorse setup already
Even if we implemented it in workhorse, it still makes sense (for high volume sites) to terminate asset traffic in nginx because it's just faster (less hops, more efficient C++ implementation, better optimised).

So, with this in mind, and sticking with the philosophy of Minimum Viable Change, I still think that delivering this as an nginx config makes the most sense.

High volume installations will almost certainly already be running nginx in front of their workhorse setup already.

We have customers running Apache and other things between workhorse and their other load balancer so saying almost certainly is unprecise at best.

I disagree that we should do this in this way but I don't care about it that much anymore. Just final stress that the more differences in behaviour we have between different installations, the more issues we will encounter and more problems we will have supporting users. We stabilised the NGINX config (less changes in the configuration) for a reason. This reason being that any change that we made in web server config caused issues for our users over prolonged periods of time (catch up with upgrades).

@andrewn If you want to take this on and do the MVP with your team including testing on GitLab.com, communicating the change with the rest of the community, you are free to do so. At the moment of writing this, Build team has no capacity to do this.

@andrewn

I'm sure 2 devs could deliver something that worked within a month

Great! The question - as always - comes down to people and priorities :-)

I assume that for this work we would need to rely on Golang developers... @stanhu how would you imagine staffing this effort / where would it fit in the stack of priorities?

@timzallmann can you provide some more insight into how enabling this would cut down total time from "web request made" to "page interactive" ?

@marin

Just final stress that the more differences in behaviour we have between different installations, the more issues we will encounter and more problems we will have supporting users.

I fully agree with this, and it has indeed been the philosophy all along.

After conversations with @stanhu and @sytses, I am now changing opinion. We should indeed simply enable gzip in NGINX, as @andrewn was arguing. The reasoning is already laid out by others in this issue... principally the ease of doing this vs. trying to rebuild gzip in workhorse + avoiding needing to maintain something built in-house.

The counterargument that I was weighing heavily -- that we wish to ship the benefits of these decisions to all users / customers -- still applies; but this only applies to customers and users who use an official installation method. Since NGINX is part of our GitLab.com setup, it is what we're using as the standard, see https://docs.gitlab.com/omnibus/settings/nginx.html and note the absence of the apache version of that page.

Furthermore, enabling gzip in NGINX is the most boring solution, and the smallest iteration.

So with this in mind, who should pick up this task @stanhu ?

I understand the concerns of the build team, but I think for something like gzip that is well-tested and implemented in every major Web server, it makes sense to add the configuration in NGINX. There is a lot of cost in maintaining our own gzip support in Workhorse, and a lot can go wrong if it doesn't work properly. We should document the optimizations for Apache and other Web servers, but things should still work fine if they don't support it.

As @andrewn mentioned in https://gitlab.com/gitlab-org/gitlab-ce/issues/33719#note_33338312, this seems like a fairly straightforward task of tweaking the NGINX config to compress assets. I did a quick test on GitLab.com. However, I noticed the /assets folder is already being compressed with gzip:

That leaves all the other endpoints. As @ayufan mentioned in https://gitlab.com/gitlab-org/gitlab-ce/issues/33719#note_33557836, we don't enable gzip compression for JSON or other endpoints because of security concerns. However, it seems to me we should find ways to mitigate BREACH attacks and consider turning on gzip where we can (e.g. https://blog.qualys.com/ssllabs/2013/08/07/defending-against-the-breach-attack, https://community.qualys.com/thread/12052).

@briann, what do you think?

I see why /assets is already gzipped. It looks like in our pipeline generation, we generate gzipped and non-gzipped files. Workhorse fetches the gzipped version for us: https://gitlab.com/gitlab-org/gitlab-workhorse/blob/v2.2.0/internal/staticpages/servefile.go#L43-49

@stanhu so our static assets are already always gzipped? :-( There goes some perceived low hanging fruit. And applying gzip to html and json is currently not advised because of the vulnerability that @ayufan highlighted... @briann appreciate your take on the vulnerability.

@timzallmann I guess this means that working on smart loading (webpack, code-splitting, async fetching, etc.) and smart asset sizing (pictures) is the next best thing?

@stanhu @ernstvn One way to mitigate BREACH is to selectively compress based on the content of the referrer header. If the referrer matches the instance URL, enable compression, otherwise don't.

\cc @sitschner

@andrewn wdyt?

I need to better understand the BREACH attack, but from my understanding it does not apply in our case as it's not feasible for the attacker for inject arbitrary content into our pages.

If you look at almost any major website, they're using br or gz content-encoding over https.

@andrewn The theory is that a malicious website that can trick a user into visiting it and monitor traffic from that user to GitLab.com can use custom crafted links back to GitLab.com and watch the size of packets to slowly determine the content of the requests, including sensitive data.

That's why checking the referrer and only compressing requests with a referrer that matches the GitLab instance is an effective counter-measure.

It sounds impossible but a malicious wifi gateway or proxy could do it.

@briann so I've been investigating this further and it appears that Rails has had built-in support for BREACH mitigation since August 2014

They do this by XOR'ing the CSRF token with a random length byte sequence which prevent the attack because the attacked can't know how long the request should be.

Here's the pull-request which adds this feature to rails https://github.com/rails/rails/pull/16570

This is good news and means we should be able to start testing with gz (or better yet br) compression asap...

Just tested on GitLab.com and I get a different CSRF token for each request, which, from my understanding, would prevent the attack:

<meta name="csrf-token" content="XXXXXXXXXX" />

Or try it yourself:

for i in 1 2 3; do curl --silent https://gitlab.com/gitlab-org/gitlab-ce  -b cookies.txt --cookie-jar cookies.txt |grep csrf-token; done

Update: fixed the curl command to use persistent cookies

I noticed this as well while trying to figure out why CSRF was broken with our LDAP login. Yes, it appears that the CSRF token is masked with a 256-bit OTP with every request specifically to thwart BREACH attacks: https://github.com/rails/rails/blob/v4.2.8/actionpack/lib/action_controller/metal/request_forgery_protection.rb#L263-L271

I think this means we should be able to turn on gzip.

thanks @andrewn for doing a thorough search :-D . So back to the original question of whose shoulders this falls on... @marin with the Build team?

Just FYI this does protect against CSRF leakage but won't necessarily protect other data. I'm still OK with turning this on but it's not without risk.

\cc @joshlambert potentially for scheduling.

Enable it to work from Workhorse - would require (how many? 2?) developers for (how long? 1 month?) to be shippable.

I was looking at something at the gitlab-workhorse repository for something unrelated and I found this: gitlab-workhorse!177 (diffs) . Does this mean that workhorse already supports this?

@jacobvosmaer-gitlab ^^ ?

@marin assuming Jacob says "yes", then of course we switch it on there (right? why not?). But assuming Jacob says "no", or "not really", or anything less than a definitive "yes", we should just continue with the path of switching it on in NGINX. How much effort is involved in that? I believe it is fully on your team?

mentioned in commit omnibus-gitlab@104a2c19

mentioned in merge request omnibus-gitlab!1814 (closed)

Workhorse supports the retrieval of pre-gzipped assets (per my comment in https://gitlab.com/gitlab-org/gitlab-ce/issues/33719#note_33811350), but it doesn't gzip responses on-the-fly.

I think this change in omnibus would enable GZIP everywhere: omnibus-gitlab!1814 (closed)

mentioned in commit omnibus-gitlab@e223a668

@stanhu if that merge request is all it takes then wow did we procrastinate...

compress html blobs sent from GitLab.com to browser

Description

Proposal

Links / references

Documentation blurb

Designs

Child items ...

Activity

Dynamic Content

On html endpoints

On json endpoints

Static Content

compress html blobs sent from GitLab.com to browser

Description

Proposal

Links / references

Documentation blurb

Relates to

Activity

Dynamic Content

On html endpoints

On json endpoints

Static Content