Skip to content

make sure we respect feed efficiency principles

Apart from paged feeds (RFC5005, see #33), we should generally check feed2exec for efficiency. We don't want to be hammering those poor sites too badly.

Normally, we're doing pretty good already, as we have an HTTP-level cache that checks headers and doesn't pull the feed if unchanged. But there are many more tricks documented in https://www.earth.org.uk/RSS-efficiency.html that we should look into. The TL;DR:

  • 1. use Cache-Control max-age HTTP headers for a "do not poll again before" time: savings of 10x or much more are likely if the feed server is set up well: an unnecessary feed poll avoided entirely is the cheapest kind!
  • 2. use a local cache and conditional GET (eg send If-Modified-Since and/or ETag HTTP headers): savings of 10x or more are likely
  • 3. allow compression of the feed that you pull down (set Accept-Encoding HTTP headers) with at least gzip: savings of 2x to 10x are likely
  • 4. avoid fetching the feed on skipHours (and/or skipDays) in an RSS feed: savings of 2x are plausible, and can be especially renewables/climate friendly
  • 5. use error responses 429 ("Too Many Requests") and 503 ("Service Unavailable") Retry-After header for a "do not poll again before" time (like Cache-Control max-age above) when present: do NOT retry immediately/faster/repeatedly!

We might be doing some of this already (2 and 3 for example) but I suspect our scheduling is not respecting the others at all, as we don't keep much state of the remote feeds.

Edited by Antoine Beaupré