๐Ÿšจ P0: CC host process crashes during TMB plugin self-dev โ€” plugin destabilizes its host (2 CC deaths 2026-04-28)
## Severity **P0 root-cause investigation. Distinct from #22.** - **#22 = MCP resilience layer.** Treats trajectory-server child-process disconnect as a fact and adds bro-side read fallback + recovery doctrine + health-check hook. - **This issue (#25) = CC-host crash root cause.** The 2026-04-28 incidents were **the entire CC session crashing**, not just the MCP child dying. Different + worse failure mode. The Human reports they have **never** seen CC crash during work on any other plugin โ€” strong cross-plugin negative evidence that this is TMB-plugin-specific. Both issues are P0 and complementary. #22 reduces blast radius for MCP-only disconnects (which can still happen independently). #25 hunts the bug that's killing the host. ## Symptom (corrected from earlier draft) The earlier description of this issue conflated MCP death with CC death. Corrected: - **MCP-only disconnect:** `mcp__plugin_tmb_trajectory-server__*` tools return 'no matching deferred tools'; CC session stays alive; bro can still answer text and run Bash but can't update task state. (See `feedback_mcp_recovery.md` memory.) - **CC-host crash (this issue):** full Claude Code session terminates. Conversation context lost. User must relaunch CC entirely. Different signal from MCP disconnect. The 2026-04-28 self-dev session experienced **two CC-host crashes**, not two MCP-only disconnects. Hypothesis: something the TMB plugin loads or emits is destabilizing CC's host process. ## Reproduction context - **Session:** 2026-04-28, fresh `claude --plugin-dir TMB/plugin` (no `--resume` per the no-resume rule), self-dev workflow - **Plugin version:** TMB plugin v0.5.0, `mcp/trajectory-server` running from `dist/index.js` - **Death pattern:** mid-session, no obvious trigger event in the user-visible transcript - **Cross-plugin baseline:** Human's experience "I never have such experience, I never see CC dead" working on any other plugin โ€” strong negative evidence against "CC platform is just flaky." ## Hypothesis space (to investigate, not decide) 1. **Hook script misbehavior** โ€” a TMB-shipped hook script (`scripts/hooks/*.sh`) emits non-zero exit + malformed stdout/stderr that CC's hook runner mishandles, killing the host 2. **Plugin-loaded JS with unhandled rejection** โ€” skill loader / hook helpers / MCP client wrapper with a promise rejection that propagates to CC's main loop 3. **MCP tool-response shape** โ€” a specific tool's return shape (oversize `task_create_batch` result, malformed JSON) crashes the SDK runtime in CC's plugin host 4. **Memory pressure cascade** โ€” trajectory-server child OOM pressure spilling to parent 5. **Specific tool-call sequence** โ€” repeatable pattern (e.g. `issue_create + discussion_append + task_create_batch + ledger_log` in one bro turn) that crosses a runtime threshold 6. **Concurrent worktree/git ops** โ€” the Agent worktree mechanism + our hooks (`branch-up-to-date-with-remote.sh`, `git-guards.sh`, `git-push-guard.sh`) interacting badly under SWE spawn None confirmed; that's the point of this issue. ## Investigation plan 1. **Add structured logging** to every TMB hook script (`scripts/hooks/*.sh`) โ€” append-only log line on entry, on each branch, on exit, with full env snapshot + exit code. Audit log on next CC death. 2. **Add structured logging** to every MCP tool entry/exit in `mcp/trajectory-server/src/server.ts` (or wherever the request handler lives): timestamp, tool name, payload size, duration, ok/error. Write to `${CLAUDE_PLUGIN_DATA}/mcp-trajectory.log`. 3. **Install `process.on('uncaughtException')` and `process.on('unhandledRejection')`** handlers in trajectory-server with full stack capture before exit. 4. **Survey CC plugin hooks docs for known crash modes** โ€” does CC kill the host on hook timeout, on hook stderr emit, on malformed additionalContext JSON? Identify any guardrails we're inadvertently violating. 5. **Audit our hook scripts for non-zero-exit paths** that could be tripping CC's host. `set -euo pipefail` makes any subshell failure a hard exit; need to confirm CC handles that gracefully. 6. **Reproduce in isolation** โ€” write a stress harness that drives the same tool-call pattern bro emitted in the 2026-04-28 sessions, watch for CC death. 7. **Check Anthropic GitHub issues** for plugin-host crash reports filtered to plugins with hooks + MCP. ## Acceptance criteria - Logs from hook scripts + MCP tool calls cover at least one new CC death event (or 4+ hours of uneventful operation as negative evidence) - `uncaughtException` / `unhandledRejection` capture stack on next death - Either: (a) a specific TMB-side bug is identified + fixed, OR (b) we have positive evidence of an upstream CC bug with a minimal repro to file at `anthropics/claude-code` ## Why P0 - Blocks self-dev (lost ~4hr of conversational context per crash on 2026-04-28) - Will block external users at multi-hour sessions, eroding trust - #22's resilience layer doesn't help โ€” it only protects against MCP disconnect; CC-host crash bypasses every bro-side safeguard - The bug, if ours, is currently shipping in v0.5.0 dist/ ## Out of scope - The MCP resilience layer in #22 (still needed; complementary) - Switching off `--experimental-sqlite` (separate trade-off) - Auto-respawning the CC host (impossible โ€” CC owns its own lifecycle) ## Cross-refs - GitLab #22 (MCP resilience, complementary P0) - Memory `feedback_mcp_recovery.md` (MCP-only disconnect recovery procedure) - Memory `feedback_cc_session_crashes_in_self_dev.md` (this issue's behavioral rule)
issue