v0.7.0 — token-burn: Honor-system sqlite3 fallback — replace with a single bro-side recovery wrapper

Problem

When the trajectory-server MCP becomes unresponsive, bro and pr-reviewer fall back to direct sqlite3 Bash calls per the tmb_recovery skill's "trajectory-server unreachable" section. This is correct safety design, but the recovery flow involves multi-tier retries and each tier's stdout/output lands in bro's context.

Evidence (verified against live source)

  • skills/tmb_recovery/SKILL.md (127 LOC) §C "trajectory-server unreachable" — exists, documents the degraded-readonly fallback path via scripts/bro-sqlite-readonly.sh (referenced).
  • tmb_recovery allowed-tools frontmatter: Bash(skills/tmb_recovery/scripts/bro-sqlite-readonly.sh:*), mcp__plugin_tmb_trajectory-server__audit_log, mcp__plugin_tmb_trajectory-server__discussion_append — narrowly scoped to invoking the helper script (good) plus minimal MCP writes.
  • scripts/hooks/mcp-health-check.sh registered in both SessionStart and UserPromptSubmit hook arrays — emits status that bro consumes.
  • Multi-tier retries: today bro often tries the MCP call, fails, reads feedback_mcp_recovery.md (user memory) to decide tier, retries — each step's output enters context.

Plan

The partial wrapper already exists (bro-sqlite-readonly.sh). The work is to make recovery transparent and reduce context spend on tier-decisions.

  1. Promote the helper to a tool-call shape — wrap the sqlite3 fallback as an MCP-like helper that bro invokes uniformly:
    • For read tools: mcp_safe_read(tool, args) tries MCP once, falls back to bro-sqlite-readonly.sh on failure, returns a unified JSON result
    • For write tools: explicit mcp_safe_write(tool, args) — tries MCP, HALTs on failure with a clear escalation prompt (no honor-system writes per !2900 Cause E lesson)
  2. Update tmb_recovery skill to point at the helper directly, replacing prose tier-instructions with a 5-line invocation pattern.
  3. Audit event mcp_recovery_path_taken with content_json: {tier, latency_ms, tool} — quantifies recovery cost over a release cycle.
  4. Reduce per-block contextmcp-health-check.sh should emit a single 1-line status, not multi-line diagnostic. Diagnostic goes to a separate file readable on-demand.

Acceptance criteria

  • New L3 test: simulate MCP unavailability; verify a read-tool call returns successfully via fallback in ≤1 Bash + ≤500 tokens of context.
  • Write-tool fallback explicitly returns "MCP unavailable; HALT" without attempting sqlite3 (preserves !2900 safety).
  • feedback_mcp_recovery.md (user memory) updated to point at the new helper rather than describing tiers manually.
  • tmb_recovery skill body shrinks by ≥20 LOC (replaces prose with helper invocation).

Out of scope

  • Replacing the MCP transport itself.
  • Auto-restart of MCP server.

Coordination

  • Pairs with #2917 (block-recovery cost) — both reduce context cost of failure paths.

Note on source

Previous description claimed feedback_mcp_recovery.md "describes 3 recovery tiers" without verifying. Verified tmb_recovery skill exists with the helper already partially in place. Adjusted plan from "build the wrapper" to "promote the existing helper to a transparent shape."

Edited by Zax Shen