Since we want to mitigate risks during our large scale rollout of Advanced global search on Elasticsearch on GitLab.com we should have a separate Redis instance. This would allow us to be less risk averse when making changes because at the moment the most scary thing would be overloading Redis and breaking features outside of advanced search.
As we roll out Elasticsearch for all paid customers on GitLab.com we're also likely to expect massive increases in load on Redis so separating the instance would probably be something we eventually wanted to do anyway.
This would at least involve support from infra to get a separate Redis instance up and running. And then since we share the same Redis client for all sidekiq work we will likely need to build some middleware in there to isolate our workloads. Which has been discussed previously in &2391 and #118820 . We will want to kick start this and it may be possible for us to just start by namespacing our redis keys with some prefix.
It will also be important to have separate clear monitoring for this Redis instance so we can easily determine when we're nearing capacity.
The main load on redis from elasticsearch right now comes from Sidekiq jobs. Sidekiq does support pushing jobs to different redis servers depending on worker (the pool option), so I think we can do this fairly easily in the application - the harder part is configuring it on GitLab.com so we actually have this separate redis infrastructure available. We'd need to stand up another (highly available) redis cluster, then configure both pums and sidekiq processes to use it.
@andrewn could I get an opinion from you on this in the context of GitLab.com ?
If we modify the elasticsearch sidekiq workers so they use a search pool for sidekiq (say), is it convenient / worthwhile to stand up separate redis infrastructure for gitlab-rails to push to, and for sidekiq workers handling the various elasticsearch queues to read from?
The intent is to isolate any negative effects of load generated by elasticsearch indexing from the rest of the application; we're also talking about changes that would result in us pushing significantly more "shared state"-style data to redis in non-sidekiq contexts.
How well does such an approach fit into your plans for improving redis scalability in-general? Is GitLab.com improved if each feature_category can be assigned its own Redis infrastructure?
I think at the moment it would amount to setting up a parallel cluster of machines similar to the existing ones handling Gitlab::Redis::SharedState. Is that too much overhead / additional cost?
Dylan Griffithchanged the descriptionCompare with previous version
If the only load we’re adding to Redis is additional sidekiq jobs, my view is that we should rather run 1 sidekiq cluster excellently rather than 2 sidekiq clusters moderately.
Our redis-sidekiq host is currently relatively unburdened. By all means, design hooks in the app to allow us to cleft this off when necessary, but it’s probably not yet.
@andrewn at some point we're going to want our Elasticsearch implementation to use Redis to store updates in sorted sets rather than sidekiq queues because it allows us to dedupe stuff and perform updates to Elasticsearch in bulk rather than individual updates which is very inneficient for Elasticsearch. At this point in time all updates to every kind of resources in GitLab will add to a sorted set and it won't be sidekiq queues. I wonder if that changes this perspective at all? I wouldn't think it would be good for us to switch to that architecture before we get a new Redis instance because we'll be overloading our main Redis instance potentially.
Could I ask this though: if Sidekiq (or some extension to Sidekiq) provided a good deduplication abstraction for idempotent jobs, would that be sufficient? Or is this more domain specific than that?
@andrewn Yes it would help and @nick.thomas pointed me to gitlab-com/gl-infra/scalability#42 (comment 280293173) but ultimately it's not quite enough because we're still processing each job individually and sidekiq is not giving us any way to pop 1000 jobs at a time from the queue and process them together so that we can do bulk updates to Elasticsearch.
I just got off the redis call, where I learned a few things.
First, sidekiq-redis is its own cluster now, rather than being part of the "shared state" cluster. So search's current use of sidekiq-as-backlog can "only" affect sidekiq - that has negative implications for the application as a whole, but far less severe than if we took down the shared-state redis.
Second, we're not particularly resource-constrained on this cluster at the moment. Peeking at the graphs: https://dashboards.gitlab.net/d/redis-sidekiq-main/redis-sidekiq-overview?orgId=1 - in current, non-error state, we're using ~360MiB, of a total 7.2GiB, and we can probably scale these machines up by another factor of ten before sidekiq job backlog becomes a serious concern. It's pretty cheap to have large numbers of jobs sat in storage, doing nothing.
The underlying issue we're worried about is that in the error condition, the elasticsearch queues grow without bound. This is similar to the Geo case where we might have a million or so Sidekiq jobs sat in the queue waiting to be processed - it's only a problem if storage is exhausted. Impact on CPU and network utilisation is basically nil, since jobs are still coming in and going out with the usual frequency.
So, I tend to agree with @andrewn that this angle isn't a priority. To be sure, I'm going to work out how how much RAM a million elasticsearch sidekiq jobs equates to. This will give us a way to predict when the size of the elasticsearch sidekiq queues is likely to be a problem, and means we can be less conservative about pulling the plug when elasticsearch is in an error state.
Looking forward, the single best thing we can do is rework application code so that we have an upper bound on how much the backlog consumes. Right now, because we're storing jobs in sidekiq for every update and not deduplicating them, there is no upper bound - so even if it takes us a week, or a month, we definitely can exhaust whatever resources we give to the redis-sidekiq instances. How much of a practical problem that is, we'll know after some number crunching!
Sidekiq can't make use of "native" redis clustering. Its pools argument is essentially application-level sharding to multiple independent redis instances. So the demands on infrastructure to stand new ones up is not insignificant, and since each one needs to be HA itself, each time we do, we're adding to the absolute number of incidents we can expect to see.
OK, a very crude approximation: I stopped rails-background-jobs and ran ElasticIndexerWorker.perform_async in a tight loop while monitoring RSS (actually used RAM) of the Redis process.
So, very pessimistically, 1,000,000 ElasticIndexerWorker jobs takes up 845MiB of RAM - around 11% of our current sidekiq-redis capacity, or 1% of a 72GiB machine, which is probably about where I'd stop being comfortable vertically scaling it. ElasticCommitIndexerWorker jobs are smaller.
@DylanGriffith do we know what the enqueue rate looks like, in terms of jobs per second? That would help us convert this into an estimate of how long we have to get elasticsearch back up and running before we'll start causing problems for other tenants.
Naively speaking, removing changed_fields would halve the size, and also open the way for us to deduplicate elasticsearch jobs, per gitlab-com/gl-infra/scalability#42 (comment 280293173), which would give us that upper limit on queue size.
One question: what makes the ElasticIndexerWorker.perform_async calls? If it is an emergency situation, could we shutdown (or slow) the rate at which we generate these jobs by throttling the upstream?
@andrewn they're created synchronously by Rails model hooks every time an UPDATE, INSERT or DELETE statement is performed on the database. Right now, they are the bookkeeping that tells us when we need to update elasticsearch
Work like #34086 (closed) allows us to store much less per item, and it'd be deduplicated by design, but it would probably operate in the same way - Rails model hooks - so again, no way to rate-limit them. It also moves the data from the sidekiq redis instance to the shared state one. I know you've spoken a little about that above; I've not investigated impact there yet.
If we lose the bookkeeping for any amount of time, then the whole elasticsearch index needs to be regenerated from scratch.
( #198282 (closed) is a radically different approach to the bookkeeping - instead of Rails model event hooks, it uses pg logical replication to learn when a document needs creating/updating/deleting in ES )
It's worth noting this is 1 of 2 queues we have for indexing stuff but this one is usually twice as long as the other queue.
So from this estimate I'd say we built up 35000 * 1.5 / 78 minutes = 673 jobs/minute = 400000 jobs/hour.
In this circumstance it would have taken 25 hours to hit 1 million jobs at 11% of Redis memory saturation.
From that perspective I think if we had known this and had checked the Redis memory saturation we would have felt comfortable letting things grow while we waited for the Elasticsearch cluster to repair itself. Worth noting though we were thinking it would take hours for the cluster to repair itself.
With that said I think we'll probably be ok with the sidekiq redis instance we have now and will hopefully be careful to watch memory saturation before we decide to disable next time.
However this number of 40k jobs per/hour is only when Elasticsearch is rolled out to ourselves plus 1 large customer. Our goal is to roll out to thousands of paying GitLab.com customers and so the rate will grow massively over the next few months and we'll be getting less and less time to react if the Elasticsearch cluster crashes in future.
Thanks for the numbers @DylanGriffith . 11 jobs/second is a useful number to have around! That's prior to deduplication, of course.
with dedup, we can calculate how much space Redis needs to hold the full set of updates for all of GitLab.com - it's number of indexable rows * size per row.
we also have to bear in mind that a ZSET has a maximum sensible boundary on size, which I expect is likely to be the limiting factor, rather than amount of RAM. At some point,insertions and deletions simply become too slow. @andrewn has suggested 10ms as a rule of thumb for that.
I'll try to get numbers on that and translate this into some concrete recommendations for when to pull the plug on indexing, instead of growing the size of the Redis set more. I think we could have a circuit breaker that does this automatically - if we set a maximum bound on the size of the set, we can then gracefully, and automatically, prune projects from the indexed set instead of going above that, maintaining the consistency of the rest of the index. Essentially an enhancement of #201756 .
Once we know how large a set is reasonable, we can make better decisions about the point of this issue - should it sit on the "shared application state" redis, or on a dedicated instance of its own?
25h is not a lot of time, taking the snapshot alone would have taken 12h: https://gitlab.com/gitlab-com/gl-infra/production/issues/1591#note_277620669 , and the cluster would still have to move shards and replay changes from translog. Besides, the rate estimate was for only 4 namespaces. If we assume that the job rate is proportional to git repos size than for all paying customers this would be: 400000*8000GB (all paying customers) / 170GB (size) =~ 18mln/h which is roughly 15GB of RAM per hour. My point here is not to argue in favour of using a dedicated redis cluster. I'm just trying to highlight the problem and help with the discussion so that we can come up with an appropriate plan. It might turn out that the best thing we can do for now is focus on reliability of redis-sidekiq: #199951 (comment 282861847) . It might also be the case that we need to monitor this very carefully and scale redis/ES appropriately so that we always have at least a few hours of headroom.