We need to consider 2 cases. 1 being regular public files, and the other being private projects.
We can create a strong ETag, by combining the sha of the branch/tag/ref, the filename (or a hash thereof) and the value of the content-encoding header if presend. For example a Link like this https://gitlab.com/T4cC0re/freebsd-ci/-/raw/master/installer/init.sh should produce these headers:
The expires header is rendered irrelevant by max-age=0 and must-revalidate (but can be kept, if desired). no-cache and no-store prevent caches from caching. But this is not what we want. We want everyone in the cache chain to revalidate using the ETag. Thus must-revalidate replaces them. pragma: no-cache is deprecated, but still used by HTTP/1.0. The value specified here is only used in HTTP/1.1+ when cache-control is not set.
For a private resource we just need to change the headers a tiny bit:
Changing the cache-control to private will ensure caches along the way do not cache the response, but the browser cache may. It will need to revalidate before using the resource. This means bypassing caches along the way and revalidating directly at the origin.
Good news, I found the root cause of the XSS issues so we should be able to move forward with this after that is fixed in !47225 (merged).
For private projects, we actually do not send no-store for anonymous users. So it's already correct. But we should change this so that the same applies to signed-in users.
For public projects, same thing, the cache headers are actually correct for anonymous users though it's still a weak ETag. Shouldn't be too hard to make this a strong etag and add extra keys that Rails uses to generate the ETag digest though.
Currently, the ETags are generated from the Git blob ID. So I think we don't need to use the file path? I'm also not sure if the ETags Rails generates are already content encoding aware.
@engwan The thing is, that without strong ETags we cannot really enable optimized cache rules as we need Respect Strong ETags so we can cache ranged requests.
Currently, the ETags are generated from the Git blob ID. So I think we don't need to use the file path?
Sure, if they are stable, that is fine. I just made the suggestion based on the current commit hash and file name as it seemed easy, but if we have another unique identifier (that is guaranteed to be updated if the ref is updated and the file is hit in that update) that also works, as long as we can make sure to keep it encoding aware.
I'm also not sure if the ETags Rails generates are already content encoding aware.
We would either need to check that, or force disable content encoding. Because we do not specify no-transform Cloudflare would compress it (and only cache the uncompressed version, vs multiple differently compressed ones)
Perhaps we should make this issue about switching to Strong Validated ETags?
If thats the goal, could you provide a link to some CloudFlare documentation that we can read that will help us understand why strong etags are preferable in this case? I ask because the cost of calculating strong etags is more than that of weak ones, so it's important to be able to quantify the upside.
Strong ETag headers ensure the resource in browser cache and on the web server are byte-for-byte identical. Domains on Enterprise plans enable strong ETag headers via a Respect Strong ETags Page Rule. Otherwise, strong ETag headers are converted to weak ETag headers. Also, set a strong ETag header in quotes (Etag: "example") or Cloudflare removes the ETag instead of converting it to a weak ETag.
... my (faint) understanding of this is that CloudFlare will handle the conversion from weak etags (offered by the upstream) to strong etags (calculated by CloudFlare).
I am not completely sure that this is something that would offer us any particular advantage though.
This means weak etags prevent caching when byte range requests are used, but strong etags mean range requests can still be cached.
I think that would be a small percentage of the requests we get for these endpoints? I think we don't even send the Accept-Ranges header for raw endpoints so we don't really get ranged requests?
I think that would be a small percentage of the requests we get for these endpoints? I think we don't even send the Accept-Ranges header for raw endpoints so we don't really get ranged requests?
@engwan But this is exactly the wrong thing IMO. There have been multiple cases where large files are served via the raw endpoints (and artifacts too for that matter). Making those cachable in Cloudflare and allowing ranged requests reduce the work our backend needs to do. Those are the heavy hitters we established the Cloudflare cache rule for in the first place.
... my (faint) understanding of this is that CloudFlare will handle the conversion from weak etags (offered by the upstream) to strong etags (calculated by CloudFlare).
@andrewn Not quite. If the Respect Strong ETags setting is enabled, strong ETags (if set) will be passed through. Otherwise, ETags of any sort provided by the origin will be converted to weak ETags. Those then are not cachable for ranged requests. If no ETags are present, none will be created.
Good point. In that case, weak "semantic" tags are much more likely to be cacheable, and therefore should be favoured in almost all cases.
As mentioned aboveCurrently, the ETags are generated from the Git blob ID. So we do already have an identifier, that is strong/unique. We just need to use that as a strong ETag to gain the benefits.
If you are referring to git refs as semantic tags, I would discourage that, as currently, a raw download from master would serve the latest (as would it with strong etags), whereas a cache based on the git ref might deliver stale resources.
The git blob ID also stays consistent across git refs unless it is changed. So we have the items cachable with an identifier, that is only changed if the actual file is changed, not when the ref in the URI path is updated.
But this is exactly the wrong thing IMO. There have been multiple cases where large files are served via the raw endpoints (and artifacts too for that matter).
Please can you quantify this.
Not quite. If the Respect Strong ETags setting is enabled, strong ETags (if set) will be passed through. Otherwise, ETags of any sort provided by the origin will be converted to weak ETags. Those then are not cachable for ranged requests. If no ETags are present, none will be created.
@T4cC0re please could you reference the CloudFlare documentation on this, it would help me understand better.
FWIW, I've just checked and the "normal" non-range-request is working as we expect it to:
$curl -vi https://gitlab.com/gitlab-com/gl-infra/scalability/-/raw/master/README.mdHTTP/1.1 200 OKEtag: W/"ef22cb5eec628a96afb97262c3b3cacc"...$curl -vi-H'If-None-Match: W/"ef22cb5eec628a96afb97262c3b3cacc"' https://gitlab.com/gitlab-com/gl-infra/scalability/-/raw/master/README.mdHTTP/1.1 304 Not ModifiedCF-Cache-Status: HIT
I also checked the workhorse logs and confirmed that the conditional hit did not hit the GitLab servers.
Weak ETags are passed through. If the Respect Strong ETags is not enabled all ETags we provide (even if we provide strong ETags) would be converted to weak ETags.
The beauty of strong ETags is, that you don't need to specify them to Cloudflare when requesting the file and that ranged requests work. Ranged requests might not be the feature we need, but it can offload strain. For example, when somebody downloads a larger file on a slow or erroneous connection, it would allow them to continue the download, instead of starting over (and once again causing strain on gitaly).
In general, it would make us more resilient to situations where we are being taken advantage of as a CDN substitute. Those have caused a few incidents in the past. With improved caching, we could dodge those bullets. This would still mean we need to engage with the account hosting the file but can do so without any impact on other users on the same gitaly shard.
This of course requires us to be able to determine the strong ETag with little effort.
One example using one of my own deployments: The first request with a unique ETag (and strong ETags being respected by CF) would report as cf-cache-status: MISS.
Bonus: With stale-if-error we could even serve traffic when we are offline for whatever reason. stale-while-revalidate should not be used, as the underlying ref might change in our case.
It looks like this issue needs to be triaged and completed before we can tackle some outstanding security issues we have: #324408 (closed).
Would it be possible to get this issue triaged by the right team? I don't see any answer to #263246 (comment 441912926) so do we know the right team to take ownership of this?
@oregand - catching back up here - I think we need to resolve the discussion in this thread (above) #263246 (comment 471664666) . It sounds like @andrewn and @engwan were not convinced on moving to strong Etags and we need to decide based on the docs linked and numbers above is this needed?
Would it be possible to get this issue triaged by the right team?
Since this is the raw endpoint, this may be best handled by the source code team.
After re-reading the threads above, I still don't understand fully the advantages we get from switching to strong etags. In #263246 (comment 471663985), it is mentioned that Cloudflare can cache them even without sending the If-None-Match header. But I think that's because of cache-control: public, max-age=14400 and not due to the strong etag.
@engwan This mainly helps us, when someone downloads from the raw endpoint. Because in those cases it is not very uncommon for the client to send a ranged request. With a weak etag, those will not be served from the cache.
This is especially important with any CDN-like abuse of the raw endpoints. As when we have strong etags and Cloudflare caches the request, and it will not hit our infrastructure repeatedly, relieving stress on gitaly and all parts between.
@dsatcher - I think we need to reconcile the conversation/question that @engwan and @T4cC0re just had above about the need for weak etags. Having the etags does sound like something that would help prevent abusive situations before we were able to roll this out.
@sean_carroll I guess the best way to solve this would be to have strong ETags on everything resolving to git, but yes. Those raw endpoints and the archive endpoints (https://gitlab.com/gitlab-org/gitlab/-/archive/master/gitlab-master.zip, etc.).
Added note, we might also want to include the projects visibility in the ETag, (such as "<Commithash>-public"). That way, if the visibility is changed to private, the ETag would change, invalidating the cache and thus addressing #324408 (closed)
Hi @oregand, this issue was not on @sean_carroll's priority list for typemaintenance and we picked from the top of that list when we created the plan for %15.3. It is however in the overall prio-list but under bug - as I probably falsely assumed it is a bug. But also there it is further down.
It may have fallen through the cracks. I am very sorry.
Since you are currently standing in for Sean, how would you see the priority of this issue here in relation to the others. I personally see the current number 1 as higher than this issue here and we target that for %15.4 (without really knowing if the capacity will admit us to do so).
Based on #263246 (comment 1039111855), it would be ideal if we could schedule this for %15.3 or %15.4 but since we seem to have quite an overload of work already in flight @dsatcher is digging into things to see if we can request assistance here for this issue
@T4cC0re yes, it's a blocking issue, so we need to prioritize it
I've read through the description and the discussion, just to summarize: does it basically means that we need to change this method to be something like:
defcached_blob?(blob,project,allow_caching: false)stale=stale?(strong_etag: [blob.id,project.visibility])# The #stale? method sets cache headers.# Because we are opinionated we set the cache headers ourselves.response.cache_control[:public]=allow_cachingresponse.cache_control[:max_age]=0response.cache_control[:must_revalidate]=true!staleend
Or in case of a request with accept-encoding: gzip, that was responded to with content-encoding: gzip
could you please also clarify whether it's expected to be handled by GitLab Rails or whether it's an additional functionality provided by another layer (for example, by NGINX)?
@igor.drozdov If my naïve understanding of the code is correct, then yes.
Assuming that every response will have a strong ETag and that the blob.id will be unique for each version of a blob.
For response.cache_control[:public] I don't quite know what it does, but it should render as public if the resource is not confidential/private, and private if it points to something that is private. This will ensure that only content that is publicly accessible will be cached by a CDN (cache-control: public).
The main concern about max-age being 0 is, that a resource could change from public to private. But this is not ideal caching wise.
Considering we have must-revalidate set, we should probably also set stale-if-error=300, so that the CDN will serve the stale response for 5 minutes, if the origin encounters a 5XX error during the revalidation. If we can live with a small turnaround time, we could set s-max-age=60 and stale-while-revalidate=60 to further strengthen the effectiveness of the CDN, while keeping a moderate turnaround time.
The clients would always recvalidate with the CDN (because of max-age=0) giving us the edge, if we need to purge CDN caches, and the CDN would serve the same response for 60 seconds (because of s-max-age=60) without turning to the origin. A request at 61 seconds after the initial retrieval (or last revalidation) would still serve the stale response, but update the content from the origin in the background, so that subsequent requests are current again. If any of the revalidations fail due to 5XX errors, the stale response will be used for 5 minutes (stale-if-error=300). This should strike a neat trade-off between caching and content being up2date.
If we have any headers, that must not be cached (such as x-request-id, and other customer specific headers) they should be added to the field Private (uppercased!) with comma separation. I am currently validating if and how this works exactly. I'll update my comment when I figure that out.
So ultimately a response might look like this for a public resource:
Oh, and another prerequisite is, that if the CDN revalidates the request (sends a request with if-none-match header and the current ETag) the origin MUST send back a 304 Not Modified, instead of the content.
@T4cC0re the !98110 (merged) has been merged and deployed, and the feature flag is enabled on staging and production. Could you please have a look at whether it works as expected and reopen this issue if any more work is required?