๐จ P0: CC host process crashes during TMB plugin self-dev โ plugin destabilizes its host (2 CC deaths 2026-04-28)
## Severity
**P0 root-cause investigation. Distinct from #22.**
- **#22 = MCP resilience layer.** Treats trajectory-server child-process disconnect as a fact and adds bro-side read fallback + recovery doctrine + health-check hook.
- **This issue (#25) = CC-host crash root cause.** The 2026-04-28 incidents were **the entire CC session crashing**, not just the MCP child dying. Different + worse failure mode. The Human reports they have **never** seen CC crash during work on any other plugin โ strong cross-plugin negative evidence that this is TMB-plugin-specific.
Both issues are P0 and complementary. #22 reduces blast radius for MCP-only disconnects (which can still happen independently). #25 hunts the bug that's killing the host.
## Symptom (corrected from earlier draft)
The earlier description of this issue conflated MCP death with CC death. Corrected:
- **MCP-only disconnect:** `mcp__plugin_tmb_trajectory-server__*` tools return 'no matching deferred tools'; CC session stays alive; bro can still answer text and run Bash but can't update task state. (See `feedback_mcp_recovery.md` memory.)
- **CC-host crash (this issue):** full Claude Code session terminates. Conversation context lost. User must relaunch CC entirely. Different signal from MCP disconnect.
The 2026-04-28 self-dev session experienced **two CC-host crashes**, not two MCP-only disconnects. Hypothesis: something the TMB plugin loads or emits is destabilizing CC's host process.
## Reproduction context
- **Session:** 2026-04-28, fresh `claude --plugin-dir TMB/plugin` (no `--resume` per the no-resume rule), self-dev workflow
- **Plugin version:** TMB plugin v0.5.0, `mcp/trajectory-server` running from `dist/index.js`
- **Death pattern:** mid-session, no obvious trigger event in the user-visible transcript
- **Cross-plugin baseline:** Human's experience "I never have such experience, I never see CC dead" working on any other plugin โ strong negative evidence against "CC platform is just flaky."
## Hypothesis space (to investigate, not decide)
1. **Hook script misbehavior** โ a TMB-shipped hook script (`scripts/hooks/*.sh`) emits non-zero exit + malformed stdout/stderr that CC's hook runner mishandles, killing the host
2. **Plugin-loaded JS with unhandled rejection** โ skill loader / hook helpers / MCP client wrapper with a promise rejection that propagates to CC's main loop
3. **MCP tool-response shape** โ a specific tool's return shape (oversize `task_create_batch` result, malformed JSON) crashes the SDK runtime in CC's plugin host
4. **Memory pressure cascade** โ trajectory-server child OOM pressure spilling to parent
5. **Specific tool-call sequence** โ repeatable pattern (e.g. `issue_create + discussion_append + task_create_batch + ledger_log` in one bro turn) that crosses a runtime threshold
6. **Concurrent worktree/git ops** โ the Agent worktree mechanism + our hooks (`branch-up-to-date-with-remote.sh`, `git-guards.sh`, `git-push-guard.sh`) interacting badly under SWE spawn
None confirmed; that's the point of this issue.
## Investigation plan
1. **Add structured logging** to every TMB hook script (`scripts/hooks/*.sh`) โ append-only log line on entry, on each branch, on exit, with full env snapshot + exit code. Audit log on next CC death.
2. **Add structured logging** to every MCP tool entry/exit in `mcp/trajectory-server/src/server.ts` (or wherever the request handler lives): timestamp, tool name, payload size, duration, ok/error. Write to `${CLAUDE_PLUGIN_DATA}/mcp-trajectory.log`.
3. **Install `process.on('uncaughtException')` and `process.on('unhandledRejection')`** handlers in trajectory-server with full stack capture before exit.
4. **Survey CC plugin hooks docs for known crash modes** โ does CC kill the host on hook timeout, on hook stderr emit, on malformed additionalContext JSON? Identify any guardrails we're inadvertently violating.
5. **Audit our hook scripts for non-zero-exit paths** that could be tripping CC's host. `set -euo pipefail` makes any subshell failure a hard exit; need to confirm CC handles that gracefully.
6. **Reproduce in isolation** โ write a stress harness that drives the same tool-call pattern bro emitted in the 2026-04-28 sessions, watch for CC death.
7. **Check Anthropic GitHub issues** for plugin-host crash reports filtered to plugins with hooks + MCP.
## Acceptance criteria
- Logs from hook scripts + MCP tool calls cover at least one new CC death event (or 4+ hours of uneventful operation as negative evidence)
- `uncaughtException` / `unhandledRejection` capture stack on next death
- Either: (a) a specific TMB-side bug is identified + fixed, OR (b) we have positive evidence of an upstream CC bug with a minimal repro to file at `anthropics/claude-code`
## Why P0
- Blocks self-dev (lost ~4hr of conversational context per crash on 2026-04-28)
- Will block external users at multi-hour sessions, eroding trust
- #22's resilience layer doesn't help โ it only protects against MCP disconnect; CC-host crash bypasses every bro-side safeguard
- The bug, if ours, is currently shipping in v0.5.0 dist/
## Out of scope
- The MCP resilience layer in #22 (still needed; complementary)
- Switching off `--experimental-sqlite` (separate trade-off)
- Auto-respawning the CC host (impossible โ CC owns its own lifecycle)
## Cross-refs
- GitLab #22 (MCP resilience, complementary P0)
- Memory `feedback_mcp_recovery.md` (MCP-only disconnect recovery procedure)
- Memory `feedback_cc_session_crashes_in_self_dev.md` (this issue's behavioral rule)
issue