Improve indexing using FindChangedPaths and ListBlobs
What does this MR do and why?
Added a new parameter OptimizedPerformance
in the taskRequestResponse
struct. The rails will send the optimized_performace
attribute to the indexer depending on the value of a feature flag zoekt_optimized_performance_indexing
. If optimized_performace
is true
then the indexer will use the EachFileChangeOptimizedPerformance
function to index.
There are two main improvements in EachFileChangeOptimizedPerformance
.
- Using
FindChangedPathsRequest
instead ofGetRawChangesRequest
. We discard the diffs and use only paths received fromGetRawChangesRequest
.FindChangedPathsRequest
is just returning the paths that remove the overhead of diffs calculation. - Collect all the
blob_id
fromFindChangedPathsRequest
. Create a hashmap ofblob_id
as key and the slice ofpaths
as the value of the hashmap. Call theListBlobs
using theblob_id
as therevisions
in theListBlobsRequest
. Callput
for eachpath
corresponds to the revision.
The IndexBatchSize
is set to 10000
. With a much higher batch size like 30000
, I was getting the error something like this: argument list too long, stderr: \"\""}
. The 10000
batch size should be pretty safe in all cases. I have verified that with the 10000
batch size the size bytes is much less than 4MB
. With SHA1
it is running fine with batch_size
of 20000
. So I am assuming with batch_size
of 10000
SHA
256` will be fine.
After reindexing with optimized performance, I did some spot-searching. And the results were the same which confirms the indexing is done successfully.
Screenshots
Before | After |
---|---|
![]() |
![]() |
How to set up and validate locally
- Make sure your computer has more than
80%
of free storage. If not then temporarily make this constant to1
: https://gitlab.com/gitlab-org/gitlab/-/blob/a17e4fe062a507c2892bd06f3b6287bdd31a3e89/ee/app/models/search/zoekt/node.rb#L14 - Create a repository locally with at least 100 files.
- Turn on the FF
Feature.enable(:zoekt_optimized_performance_indexing)
- Tail the log
gitlab-zoekt-indexer-development-1
or may begitlab-zoekt-indexer-development-2
. I am not sure which one but you can tail both logs and see where you get the logs. - Perform an indexing in the rails console. Replace
project_id
.
Search::Zoekt::IndexingTaskService.execute project_id, :force_index_repo
- Observe the
indexTime
in the log - Turn off the FF
zoekt_optimized_performance_indexing
. Wait for about~30s
. FF is cached. - Perform the same process
- Observe the
indexTime
should be much higher in the second case.
Related: gitlab#487328 (closed)