Zoekt deduplicate tasks

Background

Currently, we have a lot of duplicate tasks in the queue.

[ gprd ] production> Search::Zoekt::Task.pending.for_processing.count
=> 326198
[ gprd ] production> Search::Zoekt::Task.pending.for_processing.pluck(:project_identifier).to_set.count
=> 20664

Proposal

I believe that we need to deduplicate the tasks during creation. We haven't pursued that before because one corner case is that if the task has been sent to the node already, we need to keep the next task for the same repository to index the updates.

We could add a new state in the Task as processing and update the IndexingTaskService to create new tasks only when there are no pending tasks for a project-identifier and a specific task_type.

For example, in https://gitlab.com/gitlab-org/gitlab/-/blob/2e5860e05b1fb33db1c9c6bf98e409c3b7c87b2f/ee/app/models/search/zoekt/repository.rb#L33-39

          return if item.tasks.pending.exists?(zoekt_node_id: zoekt_index.zoekt_node_id, task_type: task_type)
          item.tasks.create!(zoekt_node_id: zoekt_index.zoekt_node_id, task_type: task_type, perform_at: perform_at)

Note: We also need to update:

Edited by Dmitry Gruzd