Add knowledge graph tasks
What does this MR do and why?
Adds KnowledgeGrap::Task model which will be used for executing tasks specific to knowledge graph.
This model is similar to Zoekt::Task model (used for executing Zoekt-related tasks), so most of shared logic is moved to a shared module. Another approach we considered was re-use of Zoekt::Task also for knowledge graph tasks, but usage of separate model has some benefits:
- better isolation of both "services"
- cleaner DB implementation - zoekt tasks are currently coupled with zoekt repository and project (project is used as a sharding key). Knowledge graph tasks are associated with knowledge graph replicas and project's namespace.
- better control when scheduling batch of tasks to be executed on a node - we can combine both types of tasks in whatever ratio
classDiagram
namespace ZoektModels {
class Node
class Index
class Repository
class ZoektTask
class EnabledNamespace
class Replica
}
namespace KnowledgeGraphModels {
class KnowledgeGraphEnabledRepository
class KnowledgeGraphReplica
class KnowledgeGraphTask
}
Node "1" --> "*" ZoektTask : has_many tasks
Node "1" --> "*" KnowledgeGraphTask : has_many graph tasks
Node "1" --> "*" Index : has_many indices
EnabledNamespace "1" --> "*" Replica : has_many replicas
Replica "1" --> "*" Index : has_many indices
Index "1" --> "*" Repository : has_many repositories
Repository "1" --> "*" ZoektTask : has_many tasks
KnowledgeGraphEnabledRepository "1" --> "*" KnowledgeGraphReplica : has_many replicas
KnowledgeGraphReplica "1" --> "*" KnowledgeGraphTask : has_many tasks
Node "1" --> "*" KnowledgeGraphReplica : has_many graph replicas
Content of a related ephemeral slack thread
jprovaznik
Tuesday at 10:31
@dgruzd
hi :wave:, before investing more time to either of solutions discussed in https://gitlab.com/gitlab-com/content-sites/handbook/-/merge_requests/13104#note_2481463254, just wanted to double check with you: you would still rather prefer to re-use Zoekt::Task also for knowledge graph tasks, correct?
NEW
35 replies
dgruzd
Tuesday at 10:37
:wave: That would be my recommendation. I think the queue is generic enough for us to re-use for the knowledge graph. We could add a JSONB column, let's say zoekt_tasks.payload to store everything that's needed to build the task json for the indexer.
Another important thing is that indexer has a concurrency setting, and it only requests N tasks from the monolith. We don't want to cause CPU starvation by requesting more tasks than that.
The only potential issue is that we're using zoekt_tasks.project_identifier as our sharding key. We could switch to namespace_identifier though
10:38
We could also discuss having a separate queue if you'd like. We'd need to figure out how to merge these before sending it to the indexer though
jprovaznik
Tuesday at 10:41
I don't have strong preference, my only concern with re-using Zoekt:Task is that existing logic is coupled with zoekt_repository.
We could also discuss having a separate queue if you'd like. We'd need to figure out how to merge these before sending it to the indexer though
well, this particular task would be pretty simple - we can just merge both lists as in https://gitlab.com/gitlab-org/gitlab/-/merge_requests/189941/diffs#22ed62ac3f59582111714856936cb41715a4df69_26_27
10:42
But I can try re-use Zoekt:Task and see how far I get with this
dgruzd
Tuesday at 10:43
I don't have strong preference, my only concern with re-using Zoekt:Task is that existing logic is coupled with zoekt_repository.
We could move zoekt_repository_id to the new payload JSONB column to remove this limitation. I think the lack of namespace_identifier might be more important (edited)
10:45
But I can try re-use Zoekt:Task and see how far I get with this
I'm ok with both approaches. Whatever you prefer. You'll have more control if it's your own queue, but we'd need to have a bit of duplication.
A plus would be that the only shared DB model would be Search::Zoekt::Node
10:45
well, this particular task would be pretty simple - we can just merge both lists as in https://gitlab.com/gitlab-org/gitlab/-/merge_requests/189941/diffs#22ed62ac3f59582111714856936cb41715a4df69_26_27
We'd need to respect the concurrency limit. With the current implementation we have concurrency_limit * 2
10:46
We can schedule a brainstorming session to discuss this synchronously if you'd like
jprovaznik
Tuesday at 10:56
I think the lack of namespace_identifier might be more important
well, for now I could still get project_identifier for knowledge graph through project namespace, but that would be rather a temporary soltuion
dgruzd
Tuesday at 10:57
Another potential benefit is that we could control queues separately. Let's say only disable Zoekt or only disable the knowledge graph
:point-up:
1
10:57
I'm starting to lean towards a separate queue :slightly_smiling_face:
10:58
We have Pause indexing setting, which stops sending tasks completely. Knowledge Graph could have its own setting
jprovaznik
Tuesday at 10:59
We'd need to respect the concurrency limit. With the current implementation we have concurrency_limit * 2
how is this limit comupted? it's just an estimation how many repos we can index at the same time on the same machine? Because knowledge graph tasks may have different demands (both for creating and for copying existing DB from other node)
dgruzd
Tuesday at 11:00
We have Indexing CPU to tasks multiplier (default 1.0). We multiply the number of CPUs by that multiplier and get the concurrency
:ack:
1
jprovaznik
Tuesday at 11:00
I guess for indexing a knowledge graph, it might be roughly same. Though copying a graph would be rather I/O operation
dgruzd
Tuesday at 11:01
I'm thinking maybe we could do a zip merge. One task from Zoekt, one task from Knowledge graph, etc..
11:01
If Zoekt is disabled, we only take the knowledge graph ones
jprovaznik
Tuesday at 11:02
yea. Another benefit of separate queues would be we can better control how many of each we schedule
:nod2:
1
11:04
ok, cool, I think a separate queue might a little bit better, so I will give it a try first
dgruzd
Tuesday at 11:04
You'd need to have your own callback logic for a separate queue though
11:04
The question is whether we want to re-use the Zoekt callback API
11:05
Probably :slightly_smiling_face:
jprovaznik
Tuesday at 11:05
yes - for callbacks, task's paylod will have to contain what type of task it is - something like this: ee/app/services/search/zoekt/callback_service.rb
11:05
I meant ee/app/services/search/zoekt/callback_service.rb
11:05
https://gitlab.com/gitlab-org/gitlab/-/merge_requests/189941/diffs#ed26e805a5b4168cc46b603ff42bf2a6f1b0de7a_31_31
dgruzd
Tuesday at 11:06
That makes sense. Maybe we can refactor it a bit to call more specific callback services to keep the top-level class as lean as possible
jprovaznik
Tuesday at 11:07
yes
:handshake:
1
dgruzd
Tuesday at 11:08
Maybe we could add another top-level field in the heartbeat API for the knowledge graph tasks https://gitlab.com/gitlab-org/gitlab/-/blob/20830b0744880eec8aa180ae83801d56f3aa30f8/ee/lib/api/internal/search/zoekt.rb#L97 (edited)
11:09
We'd calculate the concurrency and send the appropriate number of tasks to both engines
11:10
That might make the indexer implementation a bit more straightforward
11:12
One note: I think that knowledge graph should respect node watermark levels (edited)
jprovaznik
Tuesday at 11:14
yes, watermark is tracked as part of zoekt nodes, right?
dgruzd
Tuesday at 11:14
On the Zoekt side we have node watermarks as well as index watermarks. The first one uses actual disk usage, the latter uses reserved storage. We'll need to think how knowledge graph fits into this
11:14
yes, watermark is tracked as part of zoekt nodes, right?
That's correct. It's per-node watermark
jprovaznik
Tuesday at 11:14
for nodes usage we just sum up both
Related to #540785 (closed)
References
DB queries
update_all for batch of tasks: https://explain.depesz.com/s/6MhO
How to set up and validate locally
MR acceptance checklist
Evaluate this MR against the MR acceptance checklist. It helps you analyze changes to reduce risks in quality, performance, reliability, security, and maintainability.
Edited by Vitali Tatarintev