[Proposal] Better Etag Support for Resources

Resources

Description

This issue is a proposal for implementing a conditional HTTP caching scheme with very low cost cache misses. This was originally discussed as part of the https://gitlab.com/gitlab-org/gitlab-ce/issues/26396.

Using ETag Caching

Background

It is assumed that the reader is familiar with the following concepts.

Assumptions

ETag caching works best with RESTful resources: API's which map resources using the standard RESTful pattern (GET /, GET /:id, POST /, etc)
(Related to 1) ETag caching should ideally only be used with HTTP GET calls. Although the HTTP specification does allow caching to be implemented on HTTP POST methods, it is not widely used and can be confusing.
(Related to 1 & 2) ETag caching should only be used with read-only operations.
Any two HTTP endpoints with the same URL refer to the same resource. This solution will not work for HTTP endpoints where multiple resources are represented by the same URL. For example, the GitLab API's Get Current User GET /user would not be able to use the proposed solution, since the endpoint implies the authenticated user and therefore refers to multiple user resources under a single HTTP endpoint.

Why use Conditional HTTP Requests?

Currently, GitLab is using HTTP polling to check for modifications to resources. While an ideal solution to this problem would use some sort of push technology (Web sockets, HTTP Long Polling, Bayeux, etc) to avoid polling altogether, an interim solution would be to make cache hits (i.e., an HTTP poll where the server-side resource has NOT changed) very cheap and thereby scalable.

At present, cache hits and cache misses occur the same performance and server resource penalties.

Built in Rails Support for ETag Caching

Rack and Rails have moderately good builtin support for caching via the Rack::ETag and Rack::ConditionalGet middlewares.

First Iteration Proposal: A Rails Only Solution

As a first step, keeping the solution in a single package would allow us to move fast. Once we're confident that it works well, we could further improve it by utilising workhorse to check for cache hits, skipping Ruby altogether. But for now, let's focus on the first iteration.

For the rest of this proposal, I'll be using the GitLab Pipelines API in my examples. Refer to the documentation for more information on this API.

Until my Ruby, Rails and Grape knowledge improves, I'll use pseudocode rather than Ruby (sorry about this!). This solution could probably be more elegantly implemented using a mixin, middleware or some other mechanism. What I'm trying to focus on here is the process, not the implementation.

We'll use a class with several static methods:

ConditionalResources::is_none_match_valid_for_resource(request, response)
ConditionalResources::set_last_update_for_resource(request, response, last_updated)
ConditionalResources::invalidate(paths)

To add conditional caching to a route, we check whether the client has presented a valid If-None-Match header, and if so, early return an HTTP 304 Not Modified.

if ConditionalResources::is_none_match_valid_for_resource(request, response) then  
  # Cache Hit, return early
  render :nothing => true, :status => 304  
  return
end

is_none_match_valid_for_resource does the following:

If request does not include an If-None-Match return False immediately.
If a If-None-Match header has been presented, get the resource path from the request, eg /projects/5/pipelines and use this to lookup Redis string for this path etag:/projects/5/pipelines.
Iff the Redis key exists AND the value matches the header, return True

On cache miss, the route proceeds as normal, but before returning, a call to set_last_update_for_resource must be made:

ConditionalResources::set_last_update_for_resource(request, response, last_updated)

set_last_update_for_resource works by:

Generating an MD5 checksum using the provided last_update date.
Sets the response ETag header to W/#{md5}
Sets the Redis key (etag:/projects/5/pipelines) to the Etag header. To prevent Redis from filling up with etag:* keys, a suitable TTL value would be used on the key.

Finally, paths should be invalidated on model changes. This is where the invalidate method is used.

When a pipeline model is changed, it would use the after_save callback to invalidate any associated paths.

ConditionalResources::invalidate([  
    "/projects/#{projectId}/pipelines", 
    "/projects/#{projectId}/pipelines/#{pipelineId}"])

invalidate works by deleting any associated etag keys in Redis. In this example, it would execute the following command in Redis:

DEL etag:/projects/5/pipelines etag:/projects/5/pipelines/10

The result of this is that any future calls to those endpoints would force a cache miss (since the etag value has been removed from Redis)

Whiteboard Sketches

Cache Miss

Cache Hit

Performance Costs

A cache hit will cost one Redis call. In future this check could easily be migrated to workhorse for zero Ruby cost on cache hit.
A cache miss will cost two additional Redis calls (one for the GET at the beginning and one for the SETNX at the end)