Zoekt deduplicate tasks
Background
Currently, we have a lot of duplicate tasks in the queue.
[ gprd ] production> Search::Zoekt::Task.pending.for_processing.count
=> 326198
[ gprd ] production> Search::Zoekt::Task.pending.for_processing.pluck(:project_identifier).to_set.count
=> 20664
Proposal
I believe that we need to deduplicate the tasks during creation. We haven't pursued that before because one corner case is that if the task has been sent to the node already, we need to keep the next task for the same repository to index the updates.
We could add a new state in the Task as processing and update the IndexingTaskService to create new tasks only when there are no pending tasks for a project-identifier and a specific task_type.
For example, in https://gitlab.com/gitlab-org/gitlab/-/blob/2e5860e05b1fb33db1c9c6bf98e409c3b7c87b2f/ee/app/models/search/zoekt/repository.rb#L33-39
return if item.tasks.pending.exists?(zoekt_node_id: zoekt_index.zoekt_node_id, task_type: task_type)
item.tasks.create!(zoekt_node_id: zoekt_index.zoekt_node_id, task_type: task_type, perform_at: perform_at)
Note: We also need to update:
-
partitioned_byinSearch::Zoekt::Task -
https://gitlab.com/gitlab-org/gitlab/-/blob/3bcad7cec1dccf668c8b5842850122c6baa29ddc/ee/app/models/search/zoekt/task.rb#L64 to also send
processingstatuses. We might want to add a new scope forwhere(state: [:pending, :processing])
Edited by Dmitry Gruzd