Make Elasticsearch enabled check in Redis more efficient

Problem

The scope of the problem as defined in the original issue at #209854 (closed) is that we need to answer a boolean question "is ES enabled for this project?" for every update to any resource in that project. The source of truth to answer this question is in the Postgres database with a complex join query through parent namespaces. The details aren't important except to note that it's not cheap.

We are answering this question a lot for many different projects.

It's also worth noting that the answer changes very infrequently (eg. once a week/month) compared to how frequently it is answered (100 times per second). It's also worth noting that the number of projects for which the answer is "yes" is considerably less than those for which the answer is "no" because relatively few projects are indexed (today it's 45k out of millions on GitLab.com, though in the next few months it could easily be 500k-1M projects)).

Previously implemented cache does not match our scaling anymore

In the original MR(issue) we made an assumption that the cache size would be relatively small but didn't notice that ElasticsearchIndexedNamespace.limited and ElasticsearchIndexedProject.limited use projects and namespaces hierarchy to generate ids. Currently we have

[ gprd ] production> ElasticsearchIndexedNamespace.limited.count
=> 4465
[ gprd ] production> ElasticsearchIndexedProject.limited.count
=> 45824

And that's not scalable because we load the entire array in memory to do the lookup. Since we plan to add even more projects and namespaces here we need to optimise our caching to prevent an unnecessary load on our web server and redis.

Possible Solutions

1. Use separate key per project

This seems like the most obvious style of cache. You lookup the answer for a project and store the answer in Redis keyed under the project ID. This would definitely be a considerable advantage over what existed before we implemented the caching and does not suffer from the scaling problems of the current cache but isn't as efficient in some regards. Cache invalidation would be done by deleting all projects from the cache when the overall indexed namespaces/projects changes.

This is very easy to implement but has some downsides:

This cache is expensive since it is a separate key project but it's not as compact as it could be since the answer will be "no" for the vast majority of projects.
We would of course use an expiration on each key which means each project may exist in the cache for around 10 minutes but during a 10 minute window we'll be likely looking up the answer for many different projects which means we would still compute these answers with many database queries quite often.
Cache invalidation may be more expensive since we'll need to delete many keys but could be possibly more efficient by using a Redis hash so using a redis hash would probably make most sense here. It would require an extra call to set EXPIRE of a redis hash though when we first write to it so I'm not exactly sure the simplest way to accomplish that. And the expiry would be for all projects and not just the one we just read.

2. Use in memory Set based cache with Redis as cache invalidation layer only

WIP Implementation: !29657 (closed)

This seems to be a more complicated solution but will likely be more efficient.

Basically the idea would be to store a similar cache of the full list of project IDs for which Elasticsearch is enabled in the cache and lookup project ids against that list. But instead of storing it in Redis we store it memory on the node so we don't have the expensive round trip to Redis for a large list. We also store it as an in memory ruby Set so that existence lookups are quicker than array lookups.

Since this cache exists on each node the invalidation will not be as easy. To do this each node will keep a timestamp of when they calculated this Set as well and they will look up this timestamp against a last modified timestamp in Redis. If there cache is older then the node needs to recompute the cache and update their timestamp. In this case the cache invalidation just involves updating the timestamp in redis so other nodes can therefore update their own cache.

This is actually a slightly less efficient implementation of Redis server-assisted client side caching from what I can tell but then I don't actually know if we can use that.

Some downsides of this implementation:

The in memory Set on each node will consume some memory. This should be very small and considering this array is currently being loaded from Redis 100 times per second it's likely that garbage collection today on the existing cache implementation would probably be worse that just keeping it in memory all the time. See some calculations at #214440 (comment 324853649) for evidence to suggest we likely won't exceed 33MB in this Set in the near future.
This is slightly tricky to implement and we may get something wrong as with all caching efforts mistakes are inevitable

Solutions that don't seem to work

SISMEMBER : since this returns 0 when the set does not exist and therefore we don't know if something is not in the cache or if the cache has been invalidated and thus does not exist. We can't ever rely on caches existing and need to be able to handle them being cleared for various reasons. Of course you could first check with EXISTS for the existence of the cache but this would introduce a race condition where it exists at first but gets deleted after that and the next SISMEMBER lookup ends up being wrong. It's possible this problem could be mitigated by using redis transactions but I'm not 100% of the performance implications to this so could use some advice on that.

Edited Apr 16, 2020 by Dylan Griffith