Stream repository archive downloads to avoid OOM on large repositories (#327) · Issues · GitLab.org / orbit / GitLab Knowledge Graph · GitLab

Stream repository archive downloads to avoid OOM on large repositories

## Problem The code indexer's `full_download` path buffers the entire repository archive in memory before extracting it to disk. `GitlabClient::download_archive` calls `response.bytes().await?.to_vec()`, loading the full archive into a `Vec<u8>`. Then `LocalRepositoryCache::extract_archive` clones those bytes again (`archive_bytes.to_vec()`) to move them into a `spawn_blocking` task. For a 1-2 GB compressed repository, this means up to 3 copies of the archive in memory at once: the reqwest response buffer, the `Vec<u8>` return value, and the clone for the blocking task. A single download can use 3-4.5 GB of memory, enough to OOM the indexer. The tar extraction itself already streams — `GzDecoder` + `tar::Archive` iterate entries directly to disk. The problem is only in the download-to-extraction handoff. ## Proposed solution Change `download_archive` to return a `ByteStream` (the same streaming type already used by `changed_paths` and `list_blobs`) instead of `Vec<u8>`. Pipe that stream into the tar extractor using `StreamReader` + `SyncIoBridge` inside `spawn_blocking`, so HTTP response chunks flow through gzip decompression and tar extraction to disk with no intermediate buffering. Specifically: - `GitlabClient::download_archive` should use the existing `into_byte_stream()` helper instead of `response.bytes().await?` - `RepositoryService::download_archive` trait method should return `ByteStream` - `RepositoryCache::extract_archive` should accept a `ByteStream` and bridge async-to-sync I/O with `tokio_util::io::SyncIoBridge` for the `GzDecoder`/`tar` extraction (same pattern as `clickhouse-client`'s `arrow_client.rs`)

issue