feat(indexer): stream archive downloads to disk instead of buffering in memory (!656) · Merge requests · GitLab.org / orbit / GitLab Knowledge Graph

What does this MR do and why?

download_archive used to buffer the entire archive in memory (response.bytes().await?.to_vec()) and clone it again for spawn_blocking. Now it streams HTTP chunks directly through SyncIoBridge into GzDecoder + tar::Archive.

Heap usage during extraction of a 50 MB archive:

	Peak heap	Relative to archive size
Old (buffered)	100.1 MB	~2.0x
New (streaming)	0.1 MB	~0.0x

A 1 GB repository would have needed ~2-3 GB of heap before. Now it stays under 1 MB regardless of archive size.

download_archive returns a ByteStream instead of Vec<u8>, matching changed_paths and list_blobs. The stream goes through SyncIoBridge into GzDecoder + tar::Archive inside spawn_blocking, so HTTP chunks flow straight to disk with no intermediate buffer.

Closes #327 (closed)

Testing

All 175 unit tests pass. The test mocks wrap their Vec<u8> archives in single-chunk streams, so they exercise the full SyncIoBridge → GzDecoder → unpack_tar path.

Performance analysis

This merge request does not introduce any performance regression. If a performance regression is expected, explain why.

Memory usage drops from O(archive_size * 3) to O(chunk_size) for the download path. No CPU regression — tar extraction was already streaming.

Edited Mar 22, 2026 by Jean-Gabriel Doyon

feat(indexer): stream archive downloads to disk instead of buffering in memory

What does this MR do and why?

Related Issues

Testing

Performance analysis

Merge request reports