Change repository indexing to sorted sets algorithm

added advanced search bugperformance groupglobal search typefeature labels

@nick.thomas @dgruzd what do you think about this issue? I'm particularly interested in highlighting the fanout nature of this which has come up in a couple of threads already. I am curious to know if you believe that going all in on fanouts in every part of indexing is a good idea to address the observability of indexing progress. I think that addressing the problem of understanding progress is going to be key to hitting our next scaling goals here.

@DylanGriffith I'm not sure that handling repository updates a file at a time is going to be an improvement. Gitaly's access is repository-oriented; we definitely get a lot of speedup out of batching these updates per-repository, as you highlight in the description.

To get the list of files we need to add or remove, we run a git diff between two SHAs. This could be a git diff --stat at scheduling time; I don't know how much cheaper than actually generating the diff that is, but it worries me that we'd need it. Making scheduling expensive seems like a retrograde step.

There is deduplication we can do while still scheduling in units of repositories and SHAs; we can also add progress information for ongoing jobs and ensure that at-most-one per repository is running in parallel ( #32648 (closed) ). Would that combination work to address the concerns that motivate this issue?

A final note - a lot of the work that used to be done in the Ruby indexer has now moved from the Go indexer, to Gitaly (which didn't exist when it was written). I'd imagine we're still getting some performance improvements from what is left in gitlab-elasticsearch-indexer, but if we were to consider a decomposition this dramatic, I wouldn't discard the idea of doing all the gitlab-side processing in Ruby out of hand.

I think using SHAs is a better approach. We might even want to create a similar bulk processor for gitlab-elasticsearch-indexer with dedeplication.

If we schedule a list of commits for each task, we can deduplicate it on the fly. When we see that an old indexing task is a subset of a new one we can easily drop the old one.

Example

a,b,c,d,e is a linear commits history

a..b..c        <- this could be dropped
a..b..c..d     <- this also could be dropped
a..b..c..d..e  <- only this task should be processed

@DylanGriffith @nick.thomas WDYT about this draft idea?

Maybe it can be simpler than this actually. If we're already storing in our DB the last SHA indexed for a project and we only ever index the master branch then I guess our queue doesn't even need to include the from_sha or to_sha and we just de-duplicate on project id. I'm wondering if there are some flaws with that approach. It's a smaller step but allows de-duplication. It doesn't really address my other concern about not having much visibility into progress but it helps with de-duplicating work which is likely important during long queue delays because people keep pushing to master and we keep redoing the same work over and over.

@nick.thomas is there any reason we need to keep the from_sha and to_sha in the payload? Couldn't we just always indexed up until master and then we already have persisted the last indexed SHA so we are just diffing from last indexed to master.

@DylanGriffith no, I think we could remove them from the payload . Applying the sidekiq queue deduplication logic then gets us into a good place here, I think.

(Whatever we do, we have to ensure that a rebase on master does the right thing - but we don't need the SHAs in the payload to do that. We can just always index HEAD of master and check whether the last known commit is a parent of it or not)

marked this issue as related to #34086 (closed)

mentioned in issue #34086 (closed)

Setting devopsenablement ~"Category:Search" based on ~"group::search".

added Category:Global Search devopssystems labels

mentioned in issue #206855 (closed)

mentioned in issue gitlab-com/www-gitlab-com#6634

mentioned in issue #207791 (closed)

mentioned in issue #208652 (closed)

mentioned in issue gitlab-com/gl-infra/scalability#28 (closed)

added to epic &2526 (closed)

changed milestone to %12.10

added [deprecated] Accepting merge requests label

Another benefit to doing this is perhaps that we'll be sending small payloads too frequently to Elasticsearch now if we are just indexing every push and batching up every minute could already be quite an efficiency improvement. Today we're already using the bulk API but our bulks may be very small since a single push may be very small and our cluster is quite big and can handle very large bulks efficiently.

@phikai @changzhengliu @JohnMcGuire we should consider scheduling this soon. It would help a lot with the roll-out to more groups.

Currently roll-outs to new groups (ie. initial indexing) is handled by the same queue as updates to repositories which means that large roll-outs like https://gitlab.com/gitlab-org/gitlab/-/issues/211756 are blocking updates for existing indexing for long periods of time (that one took about 12 hours). If we work on this issue it separates the initial indexing from updates and means that these delays are less problematic for those projects already in the index and will make indexing faster as well so helping with reducing the time it takes to add more groups to the index.

EDIT: This is already in %12.10 so I think that's fine

mentioned in issue #204826 (closed)

mentioned in issue #213629 (closed)

Change repository indexing to sorted sets algorithm

Overview

Proposal

Performance considerations

Designs

Child items ...

Activity

Example

Change repository indexing to sorted sets algorithm

Overview

Proposal

Performance considerations

Is blocked by

Relates to

Activity

Example