Stream repository archive downloads to avoid OOM on large repositories
Problem
The code indexer's full_download path buffers the entire repository archive in memory before extracting it to disk. GitlabClient::download_archive calls response.bytes().await?.to_vec(), loading the full archive into a Vec<u8>. Then LocalRepositoryCache::extract_archive clones those bytes again (archive_bytes.to_vec()) to move them into a spawn_blocking task.
For a 1-2 GB compressed repository, this means up to 3 copies of the archive in memory at once: the reqwest response buffer, the Vec<u8> return value, and the clone for the blocking task. A single download can use 3-4.5 GB of memory, enough to OOM the indexer.
The tar extraction itself already streams — GzDecoder + tar::Archive iterate entries directly to disk. The problem is only in the download-to-extraction handoff.
Proposed solution
Change download_archive to return a ByteStream (the same streaming type already used by changed_paths and list_blobs) instead of Vec<u8>. Pipe that stream into the tar extractor using StreamReader + SyncIoBridge inside spawn_blocking, so HTTP response chunks flow through gzip decompression and tar extraction to disk with no intermediate buffering.
Specifically:
-
GitlabClient::download_archiveshould use the existinginto_byte_stream()helper instead ofresponse.bytes().await? -
RepositoryService::download_archivetrait method should returnByteStream -
RepositoryCache::extract_archiveshould accept aByteStreamand bridge async-to-sync I/O withtokio_util::io::SyncIoBridgefor theGzDecoder/tarextraction (same pattern asclickhouse-client'sarrow_client.rs)