Add knowledge graph tasks (!192114) · Merge requests · GitLab.org / GitLab

What does this MR do and why?

Adds KnowledgeGrap::Task model which will be used for executing tasks specific to knowledge graph.

This model is similar to Zoekt::Task model (used for executing Zoekt-related tasks), so most of shared logic is moved to a shared module. Another approach we considered was re-use of Zoekt::Task also for knowledge graph tasks, but usage of separate model has some benefits:

better isolation of both "services"
cleaner DB implementation - zoekt tasks are currently coupled with zoekt repository and project (project is used as a sharding key). Knowledge graph tasks are associated with knowledge graph replicas and project's namespace.
better control when scheduling batch of tasks to be executed on a node - we can combine both types of tasks in whatever ratio

classDiagram
    namespace ZoektModels {
    class Node
    class Index
    class Repository
    class ZoektTask
    class EnabledNamespace
    class Replica
    }
    namespace KnowledgeGraphModels {
    class KnowledgeGraphEnabledRepository
    class KnowledgeGraphReplica
    class KnowledgeGraphTask
    }

    Node "1" --> "*" ZoektTask : has_many tasks
    Node "1" --> "*" KnowledgeGraphTask : has_many graph tasks
    Node "1" --> "*" Index : has_many indices

    EnabledNamespace "1" --> "*" Replica : has_many replicas
    Replica "1" --> "*" Index : has_many indices
    Index "1" --> "*" Repository : has_many repositories
    Repository "1" --> "*" ZoektTask : has_many tasks

    KnowledgeGraphEnabledRepository "1" --> "*" KnowledgeGraphReplica : has_many replicas
    KnowledgeGraphReplica "1" --> "*" KnowledgeGraphTask : has_many tasks
    Node "1" --> "*" KnowledgeGraphReplica : has_many graph replicas

Content of a related ephemeral slack thread

jprovaznik
  Tuesday at 10:31
@dgruzd
 hi :wave:, before investing more time to either of solutions discussed in https://gitlab.com/gitlab-com/content-sites/handbook/-/merge_requests/13104#note_2481463254, just wanted to double check with you: you would still rather prefer to re-use Zoekt::Task also for knowledge graph tasks, correct?



NEW

35 replies


dgruzd
  Tuesday at 10:37
:wave: That would be my recommendation. I think the queue is generic enough for us to re-use for the knowledge graph. We could add a JSONB column, let's say zoekt_tasks.payload to store everything that's needed to build the task json for the indexer.
Another important thing is that indexer has a concurrency setting, and it only requests N tasks from the monolith. We don't want to cause CPU starvation by requesting more tasks than that.
The only potential issue is that we're using zoekt_tasks.project_identifier as our sharding key. We could switch to namespace_identifier though
10:38
We could also discuss having a separate queue if you'd like. We'd need to figure out how to merge these before sending it to the indexer though


jprovaznik
  Tuesday at 10:41
I don't have strong preference, my only concern with re-using Zoekt:Task is that existing logic is coupled with zoekt_repository.
We could also discuss having a separate queue if you'd like. We'd need to figure out how to merge these before sending it to the indexer though
well, this particular task would be pretty simple - we can just merge both lists as in https://gitlab.com/gitlab-org/gitlab/-/merge_requests/189941/diffs#22ed62ac3f59582111714856936cb41715a4df69_26_27
10:42
But I can try re-use Zoekt:Task and see how far I get with this


dgruzd
  Tuesday at 10:43
I don't have strong preference, my only concern with re-using Zoekt:Task is that existing logic is coupled with zoekt_repository.
We could move zoekt_repository_id to the new payload JSONB column to remove this limitation. I think the lack of namespace_identifier might be more important (edited) 
10:45
But I can try re-use Zoekt:Task and see how far I get with this
I'm ok with both approaches. Whatever you prefer. You'll have more control if it's your own queue, but we'd need to have a bit of duplication.
A plus would be that the only shared DB model would be Search::Zoekt::Node
10:45
well, this particular task would be pretty simple - we can just merge both lists as in https://gitlab.com/gitlab-org/gitlab/-/merge_requests/189941/diffs#22ed62ac3f59582111714856936cb41715a4df69_26_27
We'd need to respect the concurrency limit. With the current implementation we have concurrency_limit * 2
10:46
We can schedule a brainstorming session to discuss this synchronously if you'd like


jprovaznik
  Tuesday at 10:56
 I think the lack of namespace_identifier might be more important
well, for now I could still get project_identifier for knowledge graph through project namespace, but that would be rather a temporary soltuion


dgruzd
  Tuesday at 10:57
Another potential benefit is that we could control queues separately. Let's say only disable Zoekt or only disable the knowledge graph
:point-up:
1

10:57
I'm starting to lean towards a separate queue :slightly_smiling_face:
10:58
We have Pause indexing setting, which stops sending tasks completely. Knowledge Graph could have its own setting


jprovaznik
  Tuesday at 10:59
We'd need to respect the concurrency limit. With the current implementation we have concurrency_limit * 2
how is this limit comupted? it's just an estimation how many repos we can index at the same time on the same machine? Because knowledge graph tasks may have different demands (both for creating and for copying existing DB from other node)


dgruzd
  Tuesday at 11:00
We have Indexing CPU to tasks multiplier (default 1.0). We multiply the number of CPUs by that multiplier and get the concurrency
:ack:
1



jprovaznik
  Tuesday at 11:00
I guess for indexing a knowledge graph, it might be roughly same. Though copying a graph would be rather I/O operation


dgruzd
  Tuesday at 11:01
I'm thinking maybe we could do a zip merge. One task from Zoekt, one task from Knowledge graph, etc..
11:01
If Zoekt is disabled, we only take the knowledge graph ones


jprovaznik
  Tuesday at 11:02
yea. Another benefit of separate queues would be we can better control how many of each we schedule
:nod2:
1

11:04
ok, cool, I think a separate queue might a little bit better, so I will give it a try first


dgruzd
  Tuesday at 11:04
You'd need to have your own callback logic for a separate queue though
11:04
The question is whether we want to re-use the Zoekt callback API
11:05
Probably :slightly_smiling_face:


jprovaznik
  Tuesday at 11:05
yes - for callbacks, task's paylod will have to contain what type of task it is - something like this: ee/app/services/search/zoekt/callback_service.rb
11:05
I meant ee/app/services/search/zoekt/callback_service.rb
11:05
https://gitlab.com/gitlab-org/gitlab/-/merge_requests/189941/diffs#ed26e805a5b4168cc46b603ff42bf2a6f1b0de7a_31_31


dgruzd
  Tuesday at 11:06
That makes sense. Maybe we can refactor it a bit to call more specific callback services to keep the top-level class as lean as possible


jprovaznik
  Tuesday at 11:07
yes
:handshake:
1



dgruzd
  Tuesday at 11:08
Maybe we could add another top-level field in the heartbeat API for the knowledge graph tasks https://gitlab.com/gitlab-org/gitlab/-/blob/20830b0744880eec8aa180ae83801d56f3aa30f8/ee/lib/api/internal/search/zoekt.rb#L97 (edited) 
11:09
We'd calculate the concurrency and send the appropriate number of tasks to both engines
11:10
That might make the indexer implementation a bit more straightforward
11:12
One note: I think that knowledge graph should respect node watermark levels (edited) 


jprovaznik
  Tuesday at 11:14
yes, watermark is tracked as part of zoekt nodes, right?


dgruzd
  Tuesday at 11:14
On the Zoekt side we have node watermarks as well as index watermarks. The first one uses actual disk usage, the latter uses reserved storage. We'll need to think how knowledge graph fits into this
11:14
yes, watermark is tracked as part of zoekt nodes, right?
That's correct. It's per-node watermark


jprovaznik
  Tuesday at 11:14
for nodes usage we just sum up both

Related to #540785 (closed)

References

DB queries

update_all for batch of tasks: https://explain.depesz.com/s/6MhO

How to set up and validate locally

MR acceptance checklist

Evaluate this MR against the MR acceptance checklist. It helps you analyze changes to reduce risks in quality, performance, reliability, security, and maintainability.

Edited May 27, 2025 by Vitali Tatarintev

Add knowledge graph tasks

What does this MR do and why?

References

DB queries

How to set up and validate locally

MR acceptance checklist

Merge request reports