Create a scalable deterministic node id for indexed code
Problem
The code indexer generates deterministic node IDs by hashing component strings (path, name, etc.) using FxHasher and casting to i64:
fn compute_id(components: &[&str]) -> i64 {
let mut hasher = FxHasher::default();
components.hash(&mut hasher);
hasher.finish() as i64
}
FxHasher produces 64-bit hashes. With the birthday paradox, collision probability reaches 50% at around 4 billion nodes. For large monorepos with millions of definitions across many branches, this becomes a real concern.
The current implementation also affects edges, which reference these node IDs as source_id and target_id.
Proposed Solution
Investigate switching to Int128 for node and edge IDs. This would push the 50% collision threshold to ~10^19 nodes.
Key areas that need changes:
-
ID generation - Update
compute_id()to returni128, possibly using a 128-bit hash like xxHash3-128 or SipHash-128 -
Schema types - Add
Int128support to Arrow type mappings and ClickHouse DDL generation -
Query engine - Update
InputNode::node_ids,InputIdRange, and ID filter functions to usei128 - JSON serialization - JSON doesn't have native Int128; need to serialize as strings to avoid precision loss
-
Result formatting - Add
ColumnValue::Int128variant and handle in response formatting
The JSON serialization is the trickiest part since serde_json will lose precision for values exceeding 2^53. String representation is the safest approach but requires client-side parsing.