feat(indexer): stream archive downloads to disk instead of buffering in memory

What does this MR do and why?

download_archive used to buffer the entire archive in memory (response.bytes().await?.to_vec()) and clone it again for spawn_blocking. Now it streams HTTP chunks directly through SyncIoBridge into GzDecoder + tar::Archive.

Heap usage during extraction of a 50 MB archive:

Peak heap Relative to archive size
Old (buffered) 100.1 MB ~2.0x
New (streaming) 0.1 MB ~0.0x

A 1 GB repository would have needed ~2-3 GB of heap before. Now it stays under 1 MB regardless of archive size.

download_archive returns a ByteStream instead of Vec<u8>, matching changed_paths and list_blobs. The stream goes through SyncIoBridge into GzDecoder + tar::Archive inside spawn_blocking, so HTTP chunks flow straight to disk with no intermediate buffer.

Closes #327 (closed)

Testing

All 175 unit tests pass. The test mocks wrap their Vec<u8> archives in single-chunk streams, so they exercise the full SyncIoBridgeGzDecoderunpack_tar path.

Performance analysis

  • This merge request does not introduce any performance regression. If a performance regression is expected, explain why.

Memory usage drops from O(archive_size * 3) to O(chunk_size) for the download path. No CPU regression — tar extraction was already streaming.

Edited by Jean-Gabriel Doyon

Merge request reports

Loading