ActiveContext code: indexing service and indexer: streamed reading

The following discussion from !195308 (merged) should be addressed:


@dgruzd:

issue: The original plan was to stream the IDs. With the current implementation we load the entire output in memory. It could be 80K+ IDs for the initial indexing of gitlab-org/gitlab. My hope was that we'd process these IDs as we receive (stream) them.

Note: With the current implementation of the streamer in the chunk mode, there is no limitation in terms of repeating sections. I think we should also support that. We could add named sections if we want to detect it easier.

Here's an example of the output that's possible in theory:

--section-start--
id
hash123
hash456
--section-start--
version,build_time
v5.6.0-16-gb587744-dev,2025-06-24-0800 UTC
--section-start--
id
hash789
hash910

@maddievn:

@dgruzd I'm leaning towards making the streaming a separate MR. Our current process helpers don't support streaming (at least what I can tell) and so we have to introduce this. IMO this is a risky thing to do and should go through thorough reviews.

I implemented something in

https://gitlab.com/gitlab-org/gitlab/-/compare/545939-repository-index-worker...545939-streaming?from_project_id=278964

And it works well. I updated the indexer to stream ids with a delay:

Success:

success

When a timeout is reached:

timeout

That being said, I have concerns with the I/O and threads so would like to properly implement tests. What do you think?


@dgruzd:

@maddievn I think it's ok if we implement it in a follow-up 🤝



JTBD

Change ee/app/services/ai/active_context/code/indexer.rb from reading all output from the elasticsearch-indexer at once to using streaming as @dgruzd is suggesting in the thread above.

Basically we have to switch from using Gitlab::Popen.popen(command, nil, environment_variables) to using a streamed I/O reader which doesn't exist in gitlab yet.

I did an implementation of a streamed I/O reader in 1b472750 and this can be used as a base. It needs tests and a review from someone like @stanhu.

How to test the indexer

Follow these instructions: #550418 (comment 2610944159)

OR manually create EnabledNamespace and Repository records:

Ai::ActiveContext::Code::EnabledNamespace.create!(namespace: Group.last, connection_id: Ai::ActiveContext::Connection.active.id)

Ai::ActiveContext::Code::Repository.create!(project: Project.first, active_context_connection: Ai::ActiveContext::Connection.active, enabled_namespace: Ai::ActiveContext::Connection.active.enabled_namespace
s.first)

Run the indexer:

repository = Ai::ActiveContext::Connection.active.repositories.first
Ai::ActiveContext::Code::IndexingService.execute(repository)
Edited by 🤖 GitLab Bot 🤖