Stream repository archive downloads to avoid OOM on large repositories

Problem

The code indexer's full_download path buffers the entire repository archive in memory before extracting it to disk. GitlabClient::download_archive calls response.bytes().await?.to_vec(), loading the full archive into a Vec<u8>. Then LocalRepositoryCache::extract_archive clones those bytes again (archive_bytes.to_vec()) to move them into a spawn_blocking task.

For a 1-2 GB compressed repository, this means up to 3 copies of the archive in memory at once: the reqwest response buffer, the Vec<u8> return value, and the clone for the blocking task. A single download can use 3-4.5 GB of memory, enough to OOM the indexer.

The tar extraction itself already streams — GzDecoder + tar::Archive iterate entries directly to disk. The problem is only in the download-to-extraction handoff.

Proposed solution

Change download_archive to return a ByteStream (the same streaming type already used by changed_paths and list_blobs) instead of Vec<u8>. Pipe that stream into the tar extractor using StreamReader + SyncIoBridge inside spawn_blocking, so HTTP response chunks flow through gzip decompression and tar extraction to disk with no intermediate buffering.

Specifically:

  • GitlabClient::download_archive should use the existing into_byte_stream() helper instead of response.bytes().await?
  • RepositoryService::download_archive trait method should return ByteStream
  • RepositoryCache::extract_archive should accept a ByteStream and bridge async-to-sync I/O with tokio_util::io::SyncIoBridge for the GzDecoder/tar extraction (same pattern as clickhouse-client's arrow_client.rs)