v0.7.0 — token-burn: Paginate DB-read responses (issue_get_with_discussions, audit_log_list, file_registry_list)
## Problem (rewritten per Human review 2026-05-18)
The original plan added a hardcoded `limit: 10` default to DB-read tools. Human review pushed back: **pagination is a band-aid, not a fix.** The DB exists to (a) save tokens vs grepping the whole codebase, (b) provide anti-hallucination over MEMORY.md. If queries cost more than `grep` or are less reliable than memory, the DB has no reason to exist.
The real fix is **search-first architecture**: bro stops dumping rows; bro asks the DB for the top-K most relevant snippets. Two retrieval modes are needed:
1. **Keyword / exact / phrase** — when bro knows the term (e.g., "what audit events did `task_create_batch` emit yesterday?") → FTS5 + BM25 ranking.
2. **Semantic / fuzzy** — when bro doesn't know the exact phrasing (e.g., "what did we decide about authentication?" against a body that says "login flow security") → lightweight RAG via embedding-based vector search.
## Stack decisions (locked, per Human review)
| Layer | Choice | Why |
|---|---|---|
| Keyword search | **FTS5 (built into `node:sqlite` 3.51.3)** | Zero install; Node is CC's prerequisite; BM25 + phrase + prefix + snippets work today |
| Embedding runtime | **`@huggingface/transformers` with `onnxruntime-node`** | Prebuilt native binaries (darwin x64/arm64, linux x64/arm64, win x64); 5–10× faster than WASM-only; ~50ms warm query |
| Embedding model | **bge-small-en-v1.5** (33 MB, 384-dim) | +5–8% English quality over MiniLM-L6 at same speed bucket; lazy-downloaded on first run |
| Vector storage | **`embedding BLOB` (Float32 packed) in dedicated per-content tables** | No sqlite-vec native binary (avoids second cross-platform binary); brute-force cosine in JS over ~1k vectors is 5–10ms |
| Vector search | **Brute-force JS cosine** | Trivial code; at TMB scale (dozens-to-thousands of rows) fast enough; can add sqlite-vec later if scale grows past 10K rows |
| Hybrid ranking | **Reciprocal-Rank-Fusion (RRF)** over FTS5 + cosine + recency-decay weight | Standard hybrid; deterministic; tunable α |
## Schema strategy
Per-content embedding tables (NOT a column on the source table):
```sql
CREATE TABLE discussions_embeddings (
discussion_id INTEGER PRIMARY KEY REFERENCES discussions(id) ON DELETE CASCADE,
embedding BLOB NOT NULL, -- 384 × Float32 = 1536 bytes
model_id TEXT NOT NULL, -- 'bge-small-en-v1.5'
embedded_at TEXT NOT NULL
);
-- Same shape for audit_embeddings, file_registry_embeddings
```
Why separate tables (Option A vs adding a BLOB column to each content table):
- Clean separation of embedding lifecycle from content lifecycle
- Easy to drop+rebuild per-table when changing model
- No NULL columns when embeddings missing / not-yet-generated
- ON DELETE CASCADE keeps consistency automatically
- Migration v3 adds 3 new tables; doesn't touch existing schemas' shapes
## New MCP tools (replace/supplement existing dump-style tools)
| New tool | Replaces | Default response |
|---|---|---|
| `discussion_search(query, k=5, last_n=20, recency_alpha=0.3)` | `issue_get_with_discussions` (which currently joins ALL) | Top-K snippets with relevance + recency score; `~500 tokens` per response |
| `audit_search(query, event_types?, k=5, ...)` | `audit_log_list` (which has no limit) | Same shape |
| `file_registry_search(query, kind?, path_prefix?, k=10)` | `file_registry_list` (925 rows = 36 KB tokens today) | Same shape; path_prefix is exact-match shortcut |
Existing `*_list` / `*_get_with_discussions` tools STAY for explicit "I need the full set" cases but gain `limit` + `cursor` for pagination — search is the default path bro uses.
## Existential validation (per Human's constraint)
1. **Does search cost less than grep?** Yes: 1 search call = ~500 tokens for top-5 snippets. `grep -r` over a comparably-sized repo + reading hits = thousands of tokens for the read alone. Verified against TMB scale (143 KB file_registry, 25 KB discussions).
2. **Is DB more reliable than MEMORY.md?** Yes: SQL queries return canonical current state. MEMORY.md drifts. Embeddings get auto-recomputed on write via triggers.
## Execution shape — TWO sequential SWE tasks
### SWE 1 — FTS5 infrastructure + search MCP tools (no embeddings yet)
Branch: `feat/search-fts5-infrastructure`. Schema migration v3 adds FTS5 virtual tables + triggers for auto-sync. New MCP tools (`discussion_search`, `audit_search`, `file_registry_search`) ship with FTS5-only ranking (BM25 + recency-decay). Existing `*_list` tools gain `limit` + `cursor` for the "full data" path.
**Acceptance**: search returns ≤1 KB top-5 snippets for representative queries; FTS5 keyword ranking works; existing tools still work with new optional pagination; L0+L1+L2+L3+L4 all green.
### SWE 2 — ONNX embedding pipeline + hybrid ranking + skill updates
Branch: `feat/search-rag-onnx-embeddings`, off SWE 1's branch (stacked). Adds `@huggingface/transformers` + bge-small-en-v1.5; per-content `*_embeddings` tables; sync triggers on write (server generates embedding inline); brute-force cosine search; RRF hybrid ranking with FTS5. Updates `tmb_planning` / `tmb_review` / `tmb_recovery` skills to use `*_search` tools as the default. Backfills existing rows on migration.
**Acceptance**: semantic queries (e.g. "what did we decide about authentication" against bodies that say "login flow security") return relevant results in top-3; query latency p95 < 200ms on warm cache; backfill of current DB (25 discussions + 67 audit + 925 file_registry) completes in <30 seconds.
## Acceptance criteria (overall)
- All 3 new search tools (`discussion_search`, `audit_search`, `file_registry_search`) ship and pass L2+L3 tests
- Embedding pipeline runs on every write (no manual backfill required for new data)
- v2→v3 migration includes embedding backfill for existing rows
- Bro's reads-per-turn token cost drops by ≥80% on representative sessions (gated by #2921 measurement harness if available, else subjective benchmark)
- No regression in any existing test suite layer
- Lightweight: native deps stay at one (`onnxruntime-node`); no sqlite-vec; no vector DB
## Out of scope
- Switching off `node:sqlite` to `better-sqlite3` or another driver
- Adding sqlite-vec (defer unless brute-force cosine becomes a hot spot at >10K rows)
- Re-embedding on model change (handled by future model-bump issue)
- Removing the existing `*_list` tools entirely (kept with pagination for explicit-full-dump needs)
- Server-side query embedding cache (could be a follow-up if bro re-searches same query within a session)
## Coordination
- Pairs with #2916 (compact responses) — search IS the compact-response mechanism for read-heavy tools
- Unblocks downstream parts of #2904 (slim-skills) — skills that currently say "ask the DB for X" can switch to `*_search` and get the right shape automatically
- The measurement harness from #2921 (deferred) would quantify the ≥80% token-burn reduction; without it we use a subjective benchmark
- Resolves the Human's existential challenge ("if DB costs more than grep or is less reliable than memory.md, no point") — search makes both metrics win
## Note on source
Original plan was hardcoded `limit: 10` pagination. Pivoted per Human review (!2905 note 2026-05-18): pagination treats the symptom; search treats the cause. The DB exists to save tokens and prevent hallucination; querying must be cheap and authoritative. Stack chosen for "lightweight RAG with no user dep issues": FTS5 (zero deps) + ONNX runtime (one prebuilt native binary covering common platforms) + bge-small-en-v1.5 (33MB lazy-downloaded model) + brute-force cosine over Float32 BLOBs (no second native binary).
issue