[code] Implement incremental file fetches
### Problem to Solve
Currently, code indexing requests re-fetch full repositories every time. We need to perform incremental file fetches to avoid this.
Initially this was planned to be done using [direct connection to Gitaly](https://gitlab.com/gitlab-org/orbit/knowledge-graph/-/work_items/232) but that is not going to be possible since GKG cannot connect to Gitaly directly.
A previous merge request had put that logic in Rails but that was rightfully called out as a mistake because of the potential memory load this puts on Rails.
The ideal solution raised was that such requests should go through Workhorse so we can stream the result from Gitaly to GKG without putting memory pressure on any application.
The idea was brought up to build a full Gitaly proxy in Workhorse which could be used by multiple services and would remove the need to keep extending Workhorse. However this is not the right time to implement this solution as it could potentially scope creep the project since the current team has limited Gitaly, Rails and Workhorse knowledge.
### Proposed Solution
In order to meet production deadlines, we propose implementing two requests in Workhorse:
* `FindChangedPaths` which we would use to return the deleted and renamed files between two refs.
* `ListBlobs` which would stream the blobs content between two refs.
This seems to be the best solution because it avoids putting memory pressure on Rails by streaming results from Gitaly to GKG through Workhorse, while keeping the scope manageable for the current team instead of building a full Gitaly proxy.
### Why not use existing public endpoint
Since this is service-to-service communication, we cannot use the public endpoints because we are not making this request in the scope of a user, but of a service which has it's own shared secret. The requests made to these two endpoints will be service controlled and made for projects a Knowledge Graph enabled namespace which is a paid offering.
### Agent summary of the proposed flow
#### Request 1: Get deleted files between two refs
**RPC:** `DiffService.FindChangedPaths`
**Request:**
```protobuf
FindChangedPathsRequest {
repository: Repository { ... },
requests: [
Request {
tree_request: TreeRequest {
left_tree_revision: "old-commit-sha",
right_tree_revision: "new-commit-sha",
}
}
],
find_renames: true,
diff_filters: [DIFF_STATUS_DELETED, DIFF_STATUS_RENAMED],
}
```
**Response** (streamed, multiple messages):
```protobuf
FindChangedPathsResponse {
paths: [
ChangedPaths {
path: "src/removed_file.rb", // bytes
status: DELETED,
old_mode: 0100644,
new_mode: 0,
old_blob_id: "abc123...", // OID of the deleted blob
new_blob_id: "",
},
ChangedPaths {
path: "new/path/file.txt", // the new path
old_path: "old/path/file.txt", // the original path
status: RENAMED, // 5
old_mode: 0o100644,
new_mode: 0o100644,
old_blob_id: "abc123...",
new_blob_id: "abc123...",
score: 100, // 100 = identical content
}
// ...
]
}
```
No blob content — just metadata about what was deleted and renamed.
---
### Request 2: Get blob content for added/modified files between two refs
**RPC:** `BlobService.ListBlobs`
**Request:**
```protobuf
ListBlobsRequest {
repository: Repository { ... },
revisions: ["new-commit-sha", "--not", "old-commit-sha"],
limit: 0, // unlimited
bytes_limit: 1048576, // 1MB cap per blob, or -1 for no limit
with_paths: true,
}
```
The revision range `["new-commit-sha", "--not", "old-commit-sha"]` gives you only blobs that are **new or changed** — blobs reachable from the new ref but not from the old one.
**Response** (streamed, multiple messages, blobs can span messages):
```protobuf
ListBlobsResponse {
blobs: [
Blob {
oid: "def456...", // set only on first chunk of each blob
size: 4096, // set only on first chunk
path: "src/new_file.rb", // set only on first chunk
data: <bytes>, // content (may be truncated by bytes_limit)
},
Blob {
oid: "", // empty = continuation of previous blob
size: 0,
path: "",
data: <more bytes>,
},
Blob {
oid: "789abc...", // new OID = next blob starts here
size: 230,
path: "src/changed_file.rb",
data: <bytes>,
},
]
}
```
---
### Putting it together
From the Rust consumer's perspective:
1. Call FindChangedPaths → collect deleted/renamed paths → remove from index
2. Call ListBlobs with `--not` range → stream reassemble blobs → upsert into index
Both responses come through Workhorse as length-prefixed protobuf frames, Rails never touches the data.
issue