Separate Redis instance/cluster for Elasticsearch indexing

added advanced search backstage [DEPRECATED] bugavailability bugperformance groupglobal search labels

changed the description

Setting devopsenablement ~"Category:Search" based on ~"group::search".

added Category:Global Search devopssystems labels

added redis label

The main load on redis from elasticsearch right now comes from Sidekiq jobs. Sidekiq does support pushing jobs to different redis servers depending on worker (the pool option), so I think we can do this fairly easily in the application - the harder part is configuring it on GitLab.com so we actually have this separate redis infrastructure available. We'd need to stand up another (highly available) redis cluster, then configure both pums and sidekiq processes to use it.

The other thing we use redis for in elasticsearch right now is tracking git pushes to a project undergoing initial indexing: https://gitlab.com/gitlab-org/gitlab/blob/master/ee/app/services/ee/git/branch_push_service.rb#L40 - I don't think this represents significant load, but #34086 (closed) implies expanding the variety of keys significantly, and this could be great preparatory work for that.

assigned to @nick.thomas

changed milestone to %12.8

@andrewn could I get an opinion from you on this in the context of GitLab.com ?

If we modify the elasticsearch sidekiq workers so they use a search pool for sidekiq (say), is it convenient / worthwhile to stand up separate redis infrastructure for gitlab-rails to push to, and for sidekiq workers handling the various elasticsearch queues to read from?

The intent is to isolate any negative effects of load generated by elasticsearch indexing from the rest of the application; we're also talking about changes that would result in us pushing significantly more "shared state"-style data to redis in non-sidekiq contexts.

How well does such an approach fit into your plans for improving redis scalability in-general? Is GitLab.com improved if each feature_category can be assigned its own Redis infrastructure?

I think at the moment it would amount to setting up a parallel cluster of machines similar to the existing ones handling Gitlab::Redis::SharedState. Is that too much overhead / additional cost?

changed the description

Thanks for the ping @nick.thomas

If the only load we’re adding to Redis is additional sidekiq jobs, my view is that we should rather run 1 sidekiq cluster excellently rather than 2 sidekiq clusters moderately.

Our redis-sidekiq host is currently relatively unburdened. By all means, design hooks in the app to allow us to cleft this off when necessary, but it’s probably not yet.

@andrewn at some point we're going to want our Elasticsearch implementation to use Redis to store updates in sorted sets rather than sidekiq queues because it allows us to dedupe stuff and perform updates to Elasticsearch in bulk rather than individual updates which is very inneficient for Elasticsearch. At this point in time all updates to every kind of resources in GitLab will add to a sorted set and it won't be sidekiq queues. I wonder if that changes this perspective at all? I wouldn't think it would be good for us to switch to that architecture before we get a new Redis instance because we'll be overloading our main Redis instance potentially.

Thanks @DylanGriffith it does.

Could I ask this though: if Sidekiq (or some extension to Sidekiq) provided a good deduplication abstraction for idempotent jobs, would that be sufficient? Or is this more domain specific than that?

@andrewn Yes it would help and @nick.thomas pointed me to gitlab-com/gl-infra/scalability#42 (comment 280293173) but ultimately it's not quite enough because we're still processing each job individually and sidekiq is not giving us any way to pop 1000 jobs at a time from the queue and process them together so that we can do bulk updates to Elasticsearch.

As you may already know Elasticsearch really encourages bulk writes in order to meet high scale due to internal implementation details.

I just got off the redis call, where I learned a few things.

First, sidekiq-redis is its own cluster now, rather than being part of the "shared state" cluster. So search's current use of sidekiq-as-backlog can "only" affect sidekiq - that has negative implications for the application as a whole, but far less severe than if we took down the shared-state redis.

Second, we're not particularly resource-constrained on this cluster at the moment. Peeking at the graphs: https://dashboards.gitlab.net/d/redis-sidekiq-main/redis-sidekiq-overview?orgId=1 - in current, non-error state, we're using ~360MiB, of a total 7.2GiB, and we can probably scale these machines up by another factor of ten before sidekiq job backlog becomes a serious concern. It's pretty cheap to have large numbers of jobs sat in storage, doing nothing.

The underlying issue we're worried about is that in the error condition, the elasticsearch queues grow without bound. This is similar to the Geo case where we might have a million or so Sidekiq jobs sat in the queue waiting to be processed - it's only a problem if storage is exhausted. Impact on CPU and network utilisation is basically nil, since jobs are still coming in and going out with the usual frequency.

So, I tend to agree with @andrewn that this angle isn't a priority. To be sure, I'm going to work out how how much RAM a million elasticsearch sidekiq jobs equates to. This will give us a way to predict when the size of the elasticsearch sidekiq queues is likely to be a problem, and means we can be less conservative about pulling the plug when elasticsearch is in an error state.

Looking forward, the single best thing we can do is rework application code so that we have an upper bound on how much the backlog consumes. Right now, because we're storing jobs in sidekiq for every update and not deduplicating them, there is no upper bound - so even if it takes us a week, or a month, we definitely can exhaust whatever resources we give to the redis-sidekiq instances. How much of a practical problem that is, we'll know after some number crunching!

One more note -

Sidekiq can't make use of "native" redis clustering. Its pools argument is essentially application-level sharding to multiple independent redis instances. So the demands on infrastructure to stand new ones up is not insignificant, and since each one needs to be HA itself, each time we do, we're adding to the absolute number of incidents we can expect to see.

OK, a very crude approximation: I stopped rails-background-jobs and ran ElasticIndexerWorker.perform_async in a tight loop while monitoring RSS (actually used RAM) of the Redis process.

Arguments to ElasticIndexerWorker:

:update Issue 100000 issue_100000 {"changed_fields":[...]}

changed_fields JSON for every column in the issues table is 453 bytes, let's use that. Very pessimistic.

Results:

                                                    v This is the interesting field, in KiB
                 USER       PID %CPU %MEM    VSZ   RSS TTY      STAT START   TIME COMMAND
Before         : lupine   24260  0.5  0.0   60892  11976 ?        Ssl  Jan23  88:13 redis-server *:0
After     1,000: lupine   24260  0.5  0.0   60892  12316 ?        Ssl  Jan23  88:14 redis-server *:0 # 340 bytes/job
After    10,000: lupine   24260  0.5  0.0   64988  20156 ?        Ssl  Jan23  88:14 redis-server *:0 # 818 bytes/job
After   200,000: lupine   24260  0.5  0.5  257500 189688 ?        Ssl  Jan23  88:27 redis-server *:0 # 888 bytes/job
After 1,000,000: lupine   24260  0.5  2.7 1068508 899584 ?        Ssl  Jan23  89:26 redis-server *:0 # 887 bytes/job

So, very pessimistically, 1,000,000 ElasticIndexerWorker jobs takes up 845MiB of RAM - around 11% of our current sidekiq-redis capacity, or 1% of a 72GiB machine, which is probably about where I'd stop being comfortable vertically scaling it. ElasticCommitIndexerWorker jobs are smaller.

@DylanGriffith do we know what the enqueue rate looks like, in terms of jobs per second? That would help us convert this into an estimate of how long we have to get elasticsearch back up and running before we'll start causing problems for other tenants.

Naively speaking, removing changed_fields would halve the size, and also open the way for us to deduplicate elasticsearch jobs, per gitlab-com/gl-infra/scalability#42 (comment 280293173), which would give us that upper limit on queue size.

(Opened #201754 (closed) for removing changed_fields specifically)

marked this issue as related to #201754 (closed)

@nick.thomas thanks for the writeup!

One question: what makes the ElasticIndexerWorker.perform_async calls? If it is an emergency situation, could we shutdown (or slow) the rate at which we generate these jobs by throttling the upstream?

@andrewn they're created synchronously by Rails model hooks every time an UPDATE, INSERT or DELETE statement is performed on the database. Right now, they are the bookkeeping that tells us when we need to update elasticsearch

Work like #34086 (closed) allows us to store much less per item, and it'd be deduplicated by design, but it would probably operate in the same way - Rails model hooks - so again, no way to rate-limit them. It also moves the data from the sidekiq redis instance to the shared state one. I know you've spoken a little about that above; I've not investigated impact there yet.

If we lose the bookkeeping for any amount of time, then the whole elasticsearch index needs to be regenerated from scratch.

( #198282 (closed) is a radically different approach to the bookkeeping - instead of Rails model event hooks, it uses pg logical replication to learn when a document needs creating/updating/deleting in ES )

mentioned in merge request !24298 (merged)

A comparison with a ZSET approach to bookkeeping: !24298 (comment 281660807)

@nick.thomas thanks for the details about 1 million jobs.

From tracing the results of https://gitlab.com/gitlab-com/gl-infra/production/issues/1591 I believe that we started indexing the big customer at 16:41 UTC and 24 minutes later we were at 18k jobs. Then another 54 minutes later we were at 35k jobs.

It's worth noting this is 1 of 2 queues we have for indexing stuff but this one is usually twice as long as the other queue.

So from this estimate I'd say we built up 35000 * 1.5 / 78 minutes = 673 jobs/minute = 400000 jobs/hour.

In this circumstance it would have taken 25 hours to hit 1 million jobs at 11% of Redis memory saturation.

From that perspective I think if we had known this and had checked the Redis memory saturation we would have felt comfortable letting things grow while we waited for the Elasticsearch cluster to repair itself. Worth noting though we were thinking it would take hours for the cluster to repair itself.

With that said I think we'll probably be ok with the sidekiq redis instance we have now and will hopefully be careful to watch memory saturation before we decide to disable next time.

However this number of 40k jobs per/hour is only when Elasticsearch is rolled out to ourselves plus 1 large customer. Our goal is to roll out to thousands of paying GitLab.com customers and so the rate will grow massively over the next few months and we'll be getting less and less time to react if the Elasticsearch cluster crashes in future.

Thanks for the numbers @DylanGriffith . 11 jobs/second is a useful number to have around! That's prior to deduplication, of course.

with dedup, we can calculate how much space Redis needs to hold the full set of updates for all of GitLab.com - it's number of indexable rows * size per row.

we also have to bear in mind that a ZSET has a maximum sensible boundary on size, which I expect is likely to be the limiting factor, rather than amount of RAM. At some point,insertions and deletions simply become too slow. @andrewn has suggested 10ms as a rule of thumb for that.

I'll try to get numbers on that and translate this into some concrete recommendations for when to pull the plug on indexing, instead of growing the size of the Redis set more. I think we could have a circuit breaker that does this automatically - if we set a maximum bound on the size of the set, we can then gracefully, and automatically, prune projects from the indexed set instead of going above that, maintaining the consistency of the rest of the index. Essentially an enhancement of #201756 .

Once we know how large a set is reasonable, we can make better decisions about the point of this issue - should it sit on the "shared application state" redis, or on a dedicated instance of its own?

Decided not to do based on above reasoning and have other options for improving things.

@DylanGriffith , do you think we can close this issue ?

removed milestone

@DylanGriffith @nick.thomas

25h is not a lot of time, taking the snapshot alone would have taken 12h: https://gitlab.com/gitlab-com/gl-infra/production/issues/1591#note_277620669 , and the cluster would still have to move shards and replay changes from translog. Besides, the rate estimate was for only 4 namespaces. If we assume that the job rate is proportional to git repos size than for all paying customers this would be: 400000*8000GB (all paying customers) / 170GB (size) =~ 18mln/h which is roughly 15GB of RAM per hour. My point here is not to argue in favour of using a dedicated redis cluster. I'm just trying to highlight the problem and help with the discussion so that we can come up with an appropriate plan. It might turn out that the best thing we can do for now is focus on reliability of redis-sidekiq: #199951 (comment 282861847) . It might also be the case that we need to monitor this very carefully and scale redis/ES appropriately so that we always have at least a few hours of headroom.

unassigned @nick.thomas

mentioned in issue #199951

changed milestone to %Backlog

added [deprecated] Accepting merge requests label

mentioned in issue #213629 (closed)

closed

set weight to 1

added devopsdata stores label and removed devopssystems label

Separate Redis instance/cluster for Elasticsearch indexing

Designs

Child items ...

Activity

Separate Redis instance/cluster for Elasticsearch indexing

Relates to

Activity