perf(indexer): skip non-parsable files during archive extraction
What does this MR do and why?
When the indexer downloads a repository, it currently extracts every file from the tar archive to disk and only filters non-source files later, when the parser walks the directory. With several repositories indexing concurrently, this fills up disk with images, lockfiles, generated assets, vendored dependencies, and oversized blobs that the parser will silently throw away seconds later.
This MR moves the "is this file worth keeping" decision upstream, into the archive extractor. Each tar entry is now checked against the same filter the parser uses. Files the parser would skip never touch disk in the first place.
How it works
A new helper is_parsable(path) lives next to the language registry in code-graph. It returns true only when:
- the path has an extension a registered language claims (Ruby, Python, JS/TS, Go, Rust, Java, Kotlin, C#, plus JS-family extras like
.vue/.graphql/.json), - the path does not match any per-language exclude suffix (today:
*.min.js,*_test.go).
The same helper is now consulted in two places:
- The repository cache, before each tar entry is unpacked. It also rejects entries whose declared size exceeds the configured
max_file_size_bytes. - The parser's directory walk. The two stages were already meant to make the same decision, but the logic was duplicated; now there is a single source of truth, so they can never drift.
Symlinks and directories still extract unconditionally — symlinks cost nothing, and the directory tree is built lazily anyway.
Behavior matrix
| Archive entry | Before | After |
|---|---|---|
src/app.rs |
extracted, parsed | extracted, parsed |
assets/logo.png |
extracted, ignored on walk | skipped at extraction |
Cargo.lock, yarn.lock |
extracted, ignored on walk | skipped at extraction |
README.md, docs/*.md |
extracted, ignored on walk | skipped at extraction |
vendor/jquery.min.js |
extracted, ignored on walk | skipped at extraction |
pkg/server_test.go |
extracted, ignored on walk | skipped at extraction |
huge-generated.sql (above size limit) |
extracted, ignored on walk | skipped at extraction |
symlink → src/lib.rs |
extracted | extracted |
Expected impact
Most repositories carry far more non-source bytes than source bytes — generated artifacts, fixtures, vendored dependencies, binary assets. The on-disk footprint per concurrently indexed repository should drop substantially, easing the disk pressure observed when several indexers run in parallel. Parser behavior is unchanged: the parser walks a smaller tree and produces the same graph it did before.
Testing
Added unit coverage for the new behavior:
is_parsablerecognises supported extensions, rejects unsupported ones, respects exclude suffixes, and handles paths with no extension.- The archive extractor consults the filter for regular files, honours the size cap, leaves symlinks alone, and never writes filtered files to disk.
- The repository cache wires the parser filter and size limit end-to-end (PNG, lockfile, README, oversize,
*.min.jsare all dropped before they hit disk).
Existing extraction tests (path traversal protection, symlink containment, archive-root stripping, empty/truncated body classification) still pass unchanged — the filter sits after those checks.
Full workspace cargo clippy and cargo test are green.
Related Issues
Performance Analysis
- This merge request does not introduce any performance regression. The new filter is two lookups per archive entry (extension → language hash, exclude-suffix scan) and avoids strictly more I/O than before by skipping disk writes.