Change the repository indexing to use a sorted sets incremental update model as well. Would require incremental updates to add 1 file per sorted set as described in the last paragraph of #34086 (comment 230326472) . We can leave this to later as well since git indexing queues are not growing wildly yet and we already have some efficiencies today based on the fact that we use the last updated SHA to only update what changed.
Also as noted in !24298 (comment 286562360) one key value this will add is that all indexing work is roughly equally sized and fanned out as early as possible. Fanning out as early as possible is going to make it much easier for SRE or developers watching queues to know if the system is operating in a healthy way since initial indexing will have an expected giant burst in the number of jobs in the queue and then a gradual drop over time as we catch up. This is in contrast to what we have today where there are a few very large jobs being added that hold up the queue for hours during processing and as such all the queues keep growing over the several hours while we're catching up which makes the system look unhealthy and we don't have data to really tell us it's not. This becomes more important as we scale out to 100s or 1000s of groups at a time since we may expect these delays to go from a couple of hours to a couple of days to catch up.
When we do scale out to a couple of days to catch up we may also wish to figure out a way to ensure initial indexing is in a different queue to the incremental updates so that incremental updates aren't hugely delayed. I'm not sure if this change will make that harder or easier.
Proposal
Every blob indexing job only ever deals with a single file. But we process them in batches of 1000. During initial indexing we add one job to the sorted set per file that needs to be indexed and incremental updates only add the jobs for the updated files. We have a cron worker that picks these jobs up in batches of 1000.
Performance considerations
We don't want to regress into doing lots of indexing work in Ruby so we'd like to still leverage the Go indexer. The reason being that the Go indexer is much more CPU efficient at doing the marshalling necessary for indexing these blobs and it would be a performance regression if we went back to doing that in Ruby. As such we may wish to either push all the file names into the Go indexer and have it load them all or possibly even more efficient is to delegate the Redis queue popping into the Go indexer as well though this may be problematic if we end up duplicating lots of queue handling logic in Go.
It also may be problematic that the batches of 1000 may contain files from different repos. If this ends up being less performant from a Gitaly perspective this approach may not be ideal.
@nick.thomas@dgruzd what do you think about this issue? I'm particularly interested in highlighting the fanout nature of this which has come up in a couple of threads already. I am curious to know if you believe that going all in on fanouts in every part of indexing is a good idea to address the observability of indexing progress. I think that addressing the problem of understanding progress is going to be key to hitting our next scaling goals here.
@DylanGriffith I'm not sure that handling repository updates a file at a time is going to be an improvement. Gitaly's access is repository-oriented; we definitely get a lot of speedup out of batching these updates per-repository, as you highlight in the description.
To get the list of files we need to add or remove, we run a git diff between two SHAs. This could be a git diff --stat at scheduling time; I don't know how much cheaper than actually generating the diff that is, but it worries me that we'd need it. Making scheduling expensive seems like a retrograde step.
There is deduplication we can do while still scheduling in units of repositories and SHAs; we can also add progress information for ongoing jobs and ensure that at-most-one per repository is running in parallel ( #32648 (closed) ). Would that combination work to address the concerns that motivate this issue?
A final note - a lot of the work that used to be done in the Ruby indexer has now moved from the Go indexer, to Gitaly (which didn't exist when it was written). I'd imagine we're still getting some performance improvements from what is left in gitlab-elasticsearch-indexer, but if we were to consider a decomposition this dramatic, I wouldn't discard the idea of doing all the gitlab-side processing in Ruby out of hand.
I think using SHAs is a better approach. We might even want to create a similar bulk processor for gitlab-elasticsearch-indexer with dedeplication.
If we schedule a list of commits for each task, we can deduplicate it on the fly. When we see that an old indexing task is a subset of a new one we can easily drop the old one.
Example
a,b,c,d,e is a linear commits history
a..b..c <- this could be droppeda..b..c..d <- this also could be droppeda..b..c..d..e <- only this task should be processed
Maybe it can be simpler than this actually. If we're already storing in our DB the last SHA indexed for a project and we only ever index the master branch then I guess our queue doesn't even need to include the from_sha or to_sha and we just de-duplicate on project id. I'm wondering if there are some flaws with that approach. It's a smaller step but allows de-duplication. It doesn't really address my other concern about not having much visibility into progress but it helps with de-duplicating work which is likely important during long queue delays because people keep pushing to master and we keep redoing the same work over and over.
@nick.thomas is there any reason we need to keep the from_sha and to_sha in the payload? Couldn't we just always indexed up until master and then we already have persisted the last indexed SHA so we are just diffing from last indexed to master.
@DylanGriffith no, I think we could remove them from the payload . Applying the sidekiq queue deduplication logic then gets us into a good place here, I think.
(Whatever we do, we have to ensure that a rebase on master does the right thing - but we don't need the SHAs in the payload to do that. We can just always index HEAD of master and check whether the last known commit is a parent of it or not)
Another benefit to doing this is perhaps that we'll be sending small payloads too frequently to Elasticsearch now if we are just indexing every push and batching up every minute could already be quite an efficiency improvement. Today we're already using the bulk API but our bulks may be very small since a single push may be very small and our cluster is quite big and can handle very large bulks efficiently.
Currently roll-outs to new groups (ie. initial indexing) is handled by the same queue as updates to repositories which means that large roll-outs like https://gitlab.com/gitlab-org/gitlab/-/issues/211756 are blocking updates for existing indexing for long periods of time (that one took about 12 hours). If we work on this issue it separates the initial indexing from updates and means that these delays are less problematic for those projects already in the index and will make indexing faster as well so helping with reducing the time it takes to add more groups to the index.
EDIT: This is already in %12.10 so I think that's fine
@DylanGriffith@nick.thomas After reading through this and having a discussion with @dgruzd, it seems like the scope of this issue is now smaller than the initial proposal.
The current proposal is to stop wasting time re-indexing commits that either:
To do this, we want to leverage the Sidekiq de-duplication on the task queue, such as there can only be a single task per repository (using project_id). Each task can then be seen as indexation trigger event, ordered in time.
In order to achieve this, we need to make sure that:
We can remove sha_from, sha_to from the task parameter
Enable the de-duplication logic on the queue
Add logic in the worker to handle the rebase scenario
Please note that this work will not solve the observability issue, as the workload of each task will still be unknown to the worker and as such the relationship between the queue size and the total workload is non-linear.
@mbergeron what you are describing is valuable and certainly smaller in scope so we can do that.
I am personally inclined to additionally explore an iteration on that where updates are added to a sorted set and processed in bulk such that we aren't sending small updates to the cluster. In order to be maximally efficient and considering repo indexing is the bulk of our workloads then we need to eventually stop sending many small payloads to Elasticsearch.
If we keep processing 1 project at a time then more often than not our payloads aren't going to exceed 100k since that's already a large merge to master. But then we want to be on average sending payloads of 10M (or larger in future but this is the default setting for bulk update size and what's used at the moment on .com). I think the only way to get average bulk size to that point will be doing multiple project updates at the same time and hence it seems logical it would follow a similar implementation pattern to the other one with sorted sets etc... The only additional quirk with project updates is comparing the last indexed SHA but I think that can still work with this architecture.
I'd be keen to still explore the idea of putting the file names in the sorted sets as well (like the original proposal) so long as we could find a way to efficiently query those files from Gitaly. Which I think would really just rely on grouping by project before sending the requests to Gitaly for the files. But even without that we can use the sorted sets and buffering + comparing to last indexed SHA.
My recommendation if it's easy is to remove the extra args as you proposed and merge that since it's a small MR (perhaps create a separate issue for that). I think we get de-duplication for free so long as the worker is marked idempotent.
Then I'd propose we implement this issue as originally (using file name as key is optional but batching projects together is the point) to ensure that we are more efficiently indexing source code since they do backlog quite a bit during initial indexing periods and the de-duplication alone isn't likely to speed things up that much.
I am personally inclined to additionally explore an iteration on that where updates are added to a sorted set and processed in bulk such that we aren't sending small updates to the cluster. In order to be maximally efficient and considering repo indexing is the bulk of our workloads then we need to eventually stop sending many small payloads to Elasticsearch.
I understand the goal here, but IMO the only way you'll get a real improvement in either speed or observability is if there's a way for the scheduler to know the workload (i.e. the amount of work) whenever it runs a batch.
And this would require some logic whenever a task is enqueued, so that it can be weighted.
The proposal to go by project_id, file doesn't strike me as a large optimization, as the worker would process each project sequentially, so you'd basically only save the spawning of a new worker.
Then I'd propose we implement this issue as originally (using file name as key is optional but batching projects together is the point) to ensure that we are more efficiently indexing source code since they do backlog quite a bit during initial indexing periods and the de-duplication alone isn't likely to speed things up that much.
I'm still puzzled by how this would speed things up, in my mental model.
Great. And as I said the improvements of removing some arguments and de-duplicating sidekiq is indeed an improvement on what we have today. I recommend you make this change since it is small and valuable. That being said it is not what this issue describes so please create a separate issue and I think it is fine to put it ahead of this issue in terms of priority because it will be incrementally closer in some ways to this issue. Or don't bother creating a new issue and just do it. Either way don't close this issue as it still contains work that we should do and we don't want to lose that context.
I understand the goal here, but IMO the only way you'll get a real improvement in either speed or observability is if there's a way for the scheduler to know the workload (i.e. the amount of work) whenever it runs a batch.
I don't follow. But basically the benefits are pretty much the same as the benefits we got from implementing #34086 (closed) . Elasticsearch behaves better when you send it few large bulk updates compared to many smaller updates. When we process updates to projects individually then we are likely processing many small updates. If we can process updates to many projects at the same time in a single request to Elasticsearch we improve the performance characteristics of Elasticsearch.
Much like !24298 (merged) we would only be sending requests to the cluster every minute and they would be a batch containing all updates that have happened for the last minute. Currently we could be sending requests to the cluster far more frequently and this will behave poorly as we add more groups (projects) to the index.
And this would require some logic whenever a task is enqueued, so that it can be weighted.
I don't think we need to know the workload ahead of time like you are describing. Just stick the update in the sorted set and pull them off in batches every minute and process that batch. A batch over the last minute will be larger than an individual update and thus Elasticsearch performs better.
The proposal to go by project_id, file doesn't strike me as a large optimization, as the worker would process each project sequentially, so you'd basically only save the spawning of a new worker.
I propose the worker would process the projects in a batch. Specifically the worker would send bulk requests to Elasticsearch that contain updates to multiple different projects in the same batch request. This will obviously require some changes to the worker but that is what this issue is describing. Separating the file name provides the additional benefit of fine grained de-duplication down to what actually changed. It will also allow us to remove last indexing state from our database which means that we can get to the point where these indexers never write anything to Postgres which helps with performance to have read-only processes. But again I'd be fine with an implementation that didn't split out file name in the first attempt because it possibly has negative performance characteristics regarding our Gitaly calls or at least it may likely require more APIs added to Gitaly or possibly just more complex Ruby code to optimize how we query Gitaly.
I don't follow. But basically the benefits are pretty much the same as the benefits we got from implementing #34086 (closed) . Elasticsearch behaves better when you send it few large bulk updates compared to many smaller updates. When we process updates to projects individually then we are likely processing many small updates. If we can process updates to many projects at the same time in a single request to Elasticsearch we improve the performance characteristics of Elasticsearch.
Reading this comment made it much clearer in my mind: the bottleneck here is writing to ElasticSearch.
This was unclear in my mind, as I thought the main issue was reading the data from gitaly.
I don't think we need to know the workload ahead of time like you are describing. Just stick the update in the sorted set and pull them off in batches every minute and process that batch. A batch over the last minute will be larger than an individual update and thus Elasticsearch performs better.
Again, that ties in the false assumption that extracting the data was the issue. It makes more sense now.
@DylanGriffith thanks again for the clarifications, I'll dive into the first iteration of this and then we can see what we can move forward after.
Reading this comment made it much clearer in my mind: the bottleneck here is writing to ElasticSearch. This was unclear in my mind, as I thought the main issue was reading the data from gitaly.
@mbergeron yes that's correct with regards to this issue. This issue plans to address long term scalability from the Elasticsearch side. While it's not a bottle-neck today we will be massively increasing the load of this indexing writing in future as we add more customers. As such this issue is based on Elasticsearch recommendations to always try and batch up larger updates less frequently.
Gitaly performance is mentioned because it's important that the changes we make are not at the expense of Gitaly as it could reasonably become a bottle-neck too and the impacts are worse in that case because overloading Gitaly can affect many parts of GitLab outside of the scope of our team and we don't want that.
As it says, it should reverse the commits, but it doesn't.
I'm investigating, but I think it has to do with the fact the index cleanup code is only triggered if the commit we are indexing from is no longer is the repository instead of looking if it is in the master branch.
For instance:
git commit -m "test 1"git commit -m "test 2"# index contains `test 2`git reset --hard HEAD~1# index still contains `test 2`
I think this is a side-effect of the caching in our ElasticSearch results.
@dgruzd is there a way to disable the result cache when running specs, as it makes everything harder. More generally, can you point me at the result cache code?
@mbergeron I don't remember that we have any results caching, maybe this is a problem with refresh_interval?
In specs we use refresh_index! to make sure all changes are comitted and visible since elasticsearch writes are not reflected right away. By default we will see old results for approximately one second because of the refresh_interval setting.
Just a note that we'd actually prefer to invoke ensure_elasticsearch_index! consistently for all tests to avoid confusion since it does resolve 2 async processes.
After a discussion with @dgruzd, there was a problem with the test setup code where it wouldn't index the Project that owns the rest of the entities. As such, some queries (like the delete_index_…) that leverage the parent-child relationship would end up failing.
I've been looking at multiple avenues for this and I did settle on implementing a new Submitter in the gitlab-elasticseach-indexer, such as each index operation (either put or delete) is sent to a Redis ZSET (oplog) using the BlobID as key. This will ensure there can only be a single operation pending for a specific file.
I will then create a new cron worker that loads N operations from the queue, and submit them in bulk to Elastic by combining operations.
graph TD; ElasticSearch subgraph RoR Controller end subgraph Redis commit_indexer_queue(ElasticSearchCommitIndexer) oplog(ElasticSearchCommitIndexerOpLog) end subgraph Sidekiq commit_indexer_queue --> ElasticSearchCommitIndexer ElasticSearchCommitIndexer -- Blob Operation --> oplog oplog -. N Blob Operations .-> ElasticOpLogCronWorker end ElasticOpLogCronWorker --> ElasticSearch Controller -- project_id --> commit_indexer_queue
@DylanGriffith I'm curious about your thoughts on reporting failures that might happen when processing oplog.
Because this mode of operation is much more complex, it should be optional (i.e. under a command line flag).
implementing a new Submitter in the gitlab-elasticseach-indexer, such as each index operation (either put or delete) is sent to a Redis ZSET (oplog) using the BlobID as key
@mbergeron I'm not sure I understand what this means. Redis ZSET is a set and does not support key/value pairs. So the key is the only thing you can store. Which would mean I don't think you will have anywhere to store the overall operation. This is why the original design involves storing just the ID of the thing being updated and then we need to look that thing up later to figure out what to do with it. Which either involves deleting if it doesn't exist or updating if it does.
Because this mode of operation is much more complex, it should be optional (i.e. under a command line flag).
I'm not really sure I follow but ideally we'll only support one way of indexing in future and not need to maintain 2 code paths. Initially we may well want a 2nd code path with a feature flag as a safety net if something isn't right but after it's working correctly it will make our lives much easier to only maintain one code path.
I'm not really sure I follow but ideally we'll only support one way of indexing in future and not need to maintain 2 code paths. Initially we may well want a 2nd code path with a feature flag as a safety net if something isn't right but after it's working correctly it will make our lives much easier to only maintain one code path.
Yeah, the idea was to use a feature flag to toggle the whole behavior in both Rails and gitlab-elasticsearch-indexer
Alright so back at the drawing board — I had a nice chat with @dgruzd, and I think the following architecture would make sense, at last as a first step.
Proposal
Use a Redis ZSET as a project_id priority queue. The priority score should be calculated from:
Time (epoch) at which the job has been queued
Initial indexing penalty
As such, the score alone will help us alleviate the initial indexing problem, by making sure incremental updates are treated in priority.
Using this strategy, only one queue is needed, which reduce the housekeeping that would be necessary with two different queues.
A cron worker will then pop N elements for the queue and process them in batch with the gitlab-elasticsearch-indexer, that will be updated to support N repositories instead of a single repo. This will reduce the amount of calls that are done to ElasticSearch, by leveraging the buffering logic that is present in the gitlab-elasticsearch-indexer.
Implementation path
Update the gitlab-elasticsearch-indexer to support multiple repositories
Move the indexation logic from ElasticCommitIndexerWorker to ElasticCommitIndexerCronWorker
Add the priority queue handling logic to ElasticCommitIndexerCronWorker
Update the ElasticCommitIndexerWorker job enqueuing to use the priority queue instead of Sidekiq
As such, the score alone will help us alleviate the initial indexing problem, by making sure incremental updates are treated in priority.
@mbergeron this is a nice idea and I hope there is a way we can make that work but one of the biggest challenges we ran into when implementing the batch queue the first time was figuring out how to safely remove the items from the queue without risking losing any items if the worker was killed half way through execution. The way we did this was to separate the "read N items from queue" from "delete N items from queue". The delete step happens at the end and the only API to do this was using the zremrangebyscore method which takes the min and max score and we can rely nothing being removed in the middle of this range in our current implementation because we use an atomic incrementing integer to get the score when inserting. This guarantees that nothing is ever inserted between 2 scores in the set which means that the order never changes and therefore the first 1000 keys are always the first 1000 keys.
If we went with the solution you proposed here we will lose that guarantee (I think) and therefore we may not be able to use the same safe approach of reading N items then deleting N items if the order changes we may delete the wrong items.
You may be able to read more on !24298 (merged) to see if there are other ideas Nick explored to see if there is a way to accomplish the guarantee and also make use of priority queuing.
Other than that, though, the plan does make sense. If we can't use a priority queue we may just need to keep the ElasticCommitIndexerWorker around for a bit. Eventually we hope our integration reaches a steady state, though, where we aren't actually doing large initial indexing anymore and once we've done that I'd hope we can remove all the extra code paths related to that and just operate under the assumption that a GitLab instance is either indexed or not and we'd not need to maintain extra workers just for this priority problem.
@DylanGriffith make sense, my idea was to encode the epoch timestamp as the decimal portion of the score and use the integer part as the penalty.
0.1590409885 → incremental update
1.1590409885 → full update
Thus, we know that can score > 1 means full update.
I liked this approach because it was simple encoding that is easy to grok and understand.
If we went with the solution you proposed here we will lose that guarantee (I think) and therefore we may not be able to use the same safe approach of reading N items then deleting N items if the order changes we may delete the wrong items.
Well this will actually cause a problem if you want to fan-out the queue to multiple workers. Concurrent workers could read the same N elements and issue a double delete which would then remove the 2000 first elements, without having the second 1000 completed.
Single Worker
For the single worker use-case, I think it is safe to simply ZREM each job after they have been processed — only then would another update be enqueued for the same project_id. and that there should be no duplicates (there should only be a single job per project_id).
We could also improve the worker to use concurrent processing, where the worker process N repository at a time, using asynchronous/threaded execution.
N Workers
If we want to support multiple workers, I think we'll have a move logic, where each worker owns a Bucket and fill it with N jobs from the main queue. At execution start, it first tries to process any job in its Bucket, then move N jobs to it, then process it. Any failures should be popped from the bucket. An optimistic locking on the project_id when processing would guard against multiple concurrent indexing on the same project.
This is much more complex, so I would refrain from going that route directly.
For the single worker use-case, I think it is safe to simply ZREM each job after they have been processed
I'm looking at the docs for ZREM and seeing that it doesn't support any argument for only deleting if the score hasn't changed. But when implementing !24298 (merged) one of the important race conditions we were trying to handle is:
Project A is updated
Bulk worker runs and loads Project A from the DB
Project A is updated again from somewhere else
Bulk worker sends out of date version of Project A to Elasticsearch
Bulk worker deletes Project A from redis sorted set
The way we avoid this race condition is to use the score which will move Project A to the back of the queue again while the processing is running so it will not get deleted in the last step because we delete based on score range and so it won't be in the range anymore.
The other problem I can see, which is possibly not as big of an issue, is that you'll be issuing thousands of Redis requests compared to a single delete by range. We'd need to benchmark to understand if that's really an issue, though.
I wonder, though, whether you think there is any reason we can't use the exact same algorithm used by our ProcessBookkeepingService today or are you just hoping to come up with a better one or is there some different constraints you can see between these 2 different problems that I'm missing? Ideally we'd actually be able to share much of the code in the end if it's practical, especially seeing as this is quite a complicated thing to get right it would be good to not implement it multiple times.
Well this will actually cause a problem if you want to fan-out the queue to multiple workers
Agreed worker concurrency is an issue we'll need to address at some point and Nick and I have discussed this in the past a bit. Similar to your suggestion I think the best way to accomplish this will be by using multiple queues. The way we discussed it before was to shard the queues based on the key so that 2 equivalent elements were always in the same sorted set but there could be multiple sorted set and each worker processes one of them. We could also pre-shard so there was always, for example, 64 sorted sets but you can vary the number of workers and they just pick up some subset of those shards.
When we actually get to the point where we need to solve for this concurrency we can explore the different approaches and evaluate the moving jobs approach you suggest.
@DylanGriffith I'm currently working on the Ruby part of this, and I think I found a flaw in the current approach that would require moving some logic to the gitlab-elasticsearch-indexer
Whenever the default branch has been rebased/reset, we need to purge the associated repository from the index. This used to be done just prior to starting the indexation, which cause a small window where there would be no results for the repository.
As we now are running for N projects, this window might become way bigger, because all the purges are done before any indexation.
There are multiple ways to solve this:
We can keep this as-is and see, as the purge is an exceptional workflow.
We could move the purge logic to the gitlab-elasticsearch-indexer then we could keep the same logic than before.
We could tag each documents with an incremental indexation number, then purging would become idempotent and could be deferred after the indexation. Simply delete all the documents where project_id = ? && index_seq < latest_indexation_seq
@mbergeron my preferred way to handle this would not be to purge at all but take a diff between the last indexed SHA and the new HEAD. I wrote about this in #32649 . I wonder if this is always possible or not. If it is possible to diff these SHAs and just figure out what has been created or deleted between these 2 SHAs (regardless if they share the same history) then I'd think that is the optimal way to handle it and it would make it non-exceptional for blobs. Only tricky part is we may also need to delete commits from the index as well by figuring out what commits were on the last branch but aren't anymore.
We could tag each documents with an incremental indexation number, then purging would become idempotent and could be deferred after the indexation. Simply delete all the documents where project_id = ? && index_seq < latest_indexation_seq
I don't fully understand this suggestion. Maybe you could elaborate a little on "incremental indexation number"? We don't want to update all documents every time we index a repo, right? Only the changed documents. But possibly I don't know where/how this number gets added.
@mbergeron my preferred way to handle this would not be to purge at all but take a diff between the last indexed SHA and the new HEAD. I wrote about this in #32649 . I wonder if this is always possible or not. If it is possible to diff these SHAs and just figure out what has been created or deleted between these 2 SHAs (regardless if they share the same history) then I'd think that is the optimal way to handle it and it would make it non-exceptional for blobs. Only tricky part is we may also need to delete commits from the index as well by figuring out what commits were on the last branch but aren't anymore.
This would be a pretty big shift in logic, IDK if we want to tackle this as part of this MR, as it will grow in size a lot.
With LAST_INDEX_HEAD being the SHA of the latest indexing and HEAD being the SHA to index.
The first thing to look for is that the indexing could fail if git prune (git gc) is run after a rebase/reset but before the indexing, as one of the old commit would be deleted. We could solve that problem by using the Git ref logic1 to track the LAST_INDEX_HEAD, as Git won't delete anything that has a ref pointing to it. Thus this would ensure that LAST_INDEX_HEAD..HEAD is always a valid spec.
As for tracking the outstanding commits, I think we'd have to use some git-fu2 to find a common ancestor between HEAD and LAST_INDEX_HEAD, then:
Purge all commits from ANCESTOR..LAST_INDEX_HEAD.
Index all commits from ANCESTOR..HEAD
As for the files indexing, I think we could keep the current logic.
All of this process could be done in the gitlab-elasticsearch-indexer if we want to.
# list all commits to purgegit log --pretty=oneline "^LAST_INDEX_HEAD" HEAD# list all commits to indexgit log --pretty=oneline LAST_INDEX_HEAD "^HEAD"# alternatively, in a single callgit log --pretty=oneline --left-right LAST_INDEX_HEAD...HEAD> … this commit should be indexed< … this commit should be purged
@mbergeron that all sounds great. Thanks for figuring out the git capabilities to use for this. Do you think we'll tackle that in a separate MR like you suggest?
As we now are running for N projects, this window might become way bigger, because all the purges are done before any indexation.
I think if the described solution will make this MR too large I actually think it's fine to live with a slight delay here for a rebased/force pushed default branch considering it is very rare and the impact will likely only be ~2 mins so probably nobody will notice. WDYT?
I think if the described solution will make this MR too large I actually think it's fine to live with a slight delay here for a rebased/force pushed default branch considering it is very rare and the impact will likely only be ~2 mins so probably nobody will notice. WDYT?
That's exactly my thoughts, so I think we can do this is a separate MR.
As I said to @changzhengliu, using the git approach here is more about reducing the complexity than anything — I don't expect any performance improvement, but for the exceptional case of a when the LAST_INDEX_HEAD is no longer in the tree.
@mbergeron maybe I misunderstood you then. I thought the git approach you described above would allow us to figure out exactly which commits need to be deleted and added after a force push. If that's the case then I would have thought that is faster than our current algorithm which just purges everything and starts again.
I don't expect any performance improvement, but for the exceptional case of a when the LAST_INDEX_HEAD is no longer in the tree.
Let me rephrase that:
I don't expect a general performance improvements for indexing, because this optimization is targeted to a corner case. I don't know if we have any idea of the % of indexing that actually trigger this purge code, but I would suspect < 1%.
@mbergeron thanks for clarifying. Agreed that the frequency may be low to see a noticeable overall improvement. Though it is worth noting that this was a somewhat common behaviour for a specific customer on a large repo which kept backlogging their queues so in their case it would likely have been quite helpful. Which is what this issue is about #32649 so thanks for commenting there.
@mbergeron Since we've increased our concurrency in sidekiq substantially and observing what that means in gitlab-com/gl-infra/production#2280 (closed) I'm a little concerned that changing to using the batching cron process for initial updates will possibly slow down the backfills.
The reason being that during backfills our Elasticsearch sidekiq fleet autoscales to 8 pods with 2x concurrency which effectively means we can do 16 ElasticCommitIndexer jobs in parallel. This ends up consuming around 1000% CPU (ie. 10 cores). What I'm worried about is that our plan to move the backfills to sorted sets as well means we'll be limited to what a single core can do. Granted we expect better performance from the Elasticsearch side due to the fact that we're batching more efficiently but then again we may just need more CPU to actually keep up with this amount of work.
So I think we have a few options:
Do some benchmarking to get a sense of how throughput might be affected by this change (do you have any good ideas about how to do a realistic benchmark here without using production?)
Put the backfilling using our new sorted sets queue behind a feature flag so we can switch it off if performance regresses in production and then tackle the problem when we have real data
Actually implement the queue concurrency solution (@dgruzd had another idea that might be simpler which involves the cron worker only inspecting the min and max scores, dividing those into 10k ranges of min max scores and sending those min max score ranges to another sidekiq worker that actually picks, processes, and deletes those jobs based on the min and max score) which might actually be quite a bit simpler than implementing the sharding approach we've discussed before
Leave the backfilling logic as is and only used the sorted sets for incremental indexing which is likely where we'll see the biggest benefit anyway and decide later how to tackle the backfilling logic and any duplicated code that comes with that
@DylanGriffith thanks for the explanation. Just to be clear, I don't expect this implementation to be 16x times faster for a single core, so we would most probably slow down.
Actually implement the queue concurrency solution (@dgruzd had another idea that might be simpler which involves the cron worker only inspecting the min and max scores, dividing those into 10k ranges of min max scores and sending those min max score ranges to another sidekiq worker that actually picks, processes, and deletes those jobs based on the min and max score) which might actually be quite a bit simpler than implementing the sharding approach we've discussed before
This MR is already too big so I will refrain from adding anything.
Leave the backfilling logic as is and only used the sorted sets for incremental indexing which is likely where we'll see the biggest benefit anyway and decide later how to tackle the backfilling logic and any duplicated code that comes with that
In this context, I think this is what we should do. I'll revert the changes to ElasticCommitIndexerWorker so it does the processing and make sure we are enqueuing jobs for it.
I'll also remove the Gitlab::Elastic::Indexer::InitialProcessor because we won't need it for now.
One thing we could do is leave the wiki in the new queue and see how it fares there, as it should have less load on it, WDYT?
After a thorough discussion with @DylanGriffith, we came to the conclusion that the current implementation would most likely decrease our indexing performance because it will be really hard to figure out the proper batch size of projects to process.
Also, after a review of !35036 (closed), we came to the conclusion that the current scheduling strategy is sub-optimal for the workload this MR would bring to the queue. I'll try to summarize our conclusions here, and hopefully we'll be able to figure out something.
Project as unit of work
This MR enqueues Project references to be indexed by the gitlab-elasticsearch-indexer, in bulk. Each bulk is run by a cron worker, which as some execution limits, like time constraint:
Execution time should be less than 60 seconds, as the cron worker is scheduled every minute
Execution cannot be > 10 minutes or we lose the locking guarantees (undefined behavior)
Because of the opaque nature of the payload in the queue1, it will be very hard to find a proper batch size, that would optimize:
Maximum bulk request size (i.e. bigger request, lower frequency)
Minimum execution time (i.e. avg < 60s, never > 10 min)
A too low value would lower the efficiency of the worker, as the cron job would be waiting for the next scheduling to process.
A too high value would risk running into undefined behavior (losing the lock), as the variance between each execution might be very high.
Possible solution
One solution here would be to use the same queuing paradigm than the BulkIndexer, which is to use a deterministic payload to put in the queue. @DylanGriffith suggested to use a file based approach, where each file would be enqueued (along with some metadata) and the gitlab-elasticsearch-indexer would support a per-file based process (which goes against Gitaly per-repository access model, but that could be figured out)
Single threaded queue
The problems above could mostly be alleviated if the queue could be processed concurrently, as multiple cron workers could be run to process the queue, removing the need for a lock.
But there's a catch, our current ZSET implementation doesn't support concurrency. I think this is a significant caveat of the current approach, as we are currently seeing (in this issue predominantly) that vertical scaling is much harder than horizontal scaling.
Possible solution
Use a Working Set approach where each worker is assigned to a working set. The Working Set represent a transient space where the jobs to be processed are stored to ensure the reliability of the queue. It should be filled atomically from the Processing Queue.
Each worker is responsible for the housekeeping of its Working Set, re-enqueuing failures to the Processing Queue. Alternatively, a Broker process could manage populating the Working Set and do the housekeeping.
Items in the Processing Queue should be unique, to keep with our current de-duplicating logic.
For instance, we could use the following Lua scripts to achieve that in Redis (these are simply POC and most probably need further tweaking):
LPUSHNX.lua
-- Push elements to a List only if they don't already exists.-- Complexity O(N)localitems={}fori,iteminipairs(ARGV)doifredis.call("SISMEMBER",KEYS[1]..':set',item)==0thenredis.call("SADD",KEYS[1]..':set',item)redis.call("RPUSH",KEYS[1],item)table.insert(items,item)endendreturnitems
LRANGENX.lua
-- Moves a RANGE of elements to a working set `KEY:wset` and returns -- the list of elements.---- Complexity O(N*log(N))localitems={}localat=redis.call("TIME")fori=0,ARGV[1]dolocalitem=redis.call("LPOP",KEYS[1])ifnotitemthenreturnitemsendredis.call("SREM",KEYS[1]..':set',item)redis.call("ZADD",KEYS[1]..':wset',i,item)table.insert(items,item)endreturnitems
State of this work
In the end, we are currently at a crossroads in whether or not this work is valuable and justified as of now. The current solution, the ElasticCommitIndexerWorker still currently does the job and supports concurrency out-of-the-box via Sidekiq scheduling.
This is why we (@DylanGriffith and @mbergeron) are suggesting to take a hiatus on this work, and revisit later if we really need this.
This decision is not easy to make, as I've worked on this for a long time now, but taking a step back might be a good idea.
It is impossible to evaluate the processing time for a Project ahead-of-time, as there are many factors that might affect it. This is in contrast with the way the BulkIndexer works, as it enqueue individual entities that will each result in a single document update.