Reduce impact of workhorse:notifications pubsub in Redis

As found in #1346 (closed), the Redis pubsub on the workhorse:notifications key (used for optimizing requests for new jobs by Runners that are long-polling) is a substantial portion of the current load on our persistent Redis (Shared State). This is driven significantly by the number of subscribers, where the actual rate of publish's is fairly low (order of 300-400/s) compared to other Redis commands (e.g. get at 35K/second). Keeping it in that Redis partition/instance is a constraint on the usefulness of moving that to Redis Cluster (planned step in &618). It does not block that work, but by being one key (hashes to one node only), it would result in an imbalance in load distribution across the cluster, meaning we have to over-provision the cluster on an ongoing basis. And generally speaking when we find such a discrete cause of high-load there's an opportunity to evaluate alternatives.

Some initial options (without limitation; there are likely others):

  1. Partition to its own Redis instance, which will remain a traditional Sentinel-based non-Cluster Redis until it reaches saturation.
  2. Rather than have the entire API fleet having to subscribe, route the relevant web calls to a smaller dedicated API fleet so we can have fewer workhorse's subscribed. We'd need to do the math on request rates and sizes to see if that will reduce the subscriber count enough to make the impact small enough to be worth it (left as an exercise for the reader). This has already been proposed in delivery#1766 for other reasons.
  3. Redis Streams may behave differently enough to be better, although this is not certain at first reading, and it appears to have the same single-key behavior under Cluster, so we'd be looking for an absolute performance improvement up front and then still partition it. Reference: https://redis.io/topics/streams-intro
  4. Some different tech entirely; Redis pub/sub has worked well so far, but maybe it has run its course. This may be a later option having done option 1 and/or 2 first.

Again: this is not an exhaustive list, there may be other approaches.

Edited by Craig Miskell