Token counter undercounts in several test cases causing prompt is too long errors
Problem to solve
As a Duo Workflow user, I want the conversation trimmer to accurately count tokens, so I can avoid "prompt is too long" errors during my workflow sessions.
The ApproximateTokenCounter undercounts tokens in several scenarios, causing the trimmer to believe there's more context window space than actually exists. This results in prompts that exceed the model's limit.
Root Cause
The token counting formula len(content) // 4 * 1.5 assumes ~4 characters per token with a 50% buffer. This breaks down for:
- Unicode/non-ASCII content: Chinese, Japanese, emojis tokenize into many more tokens than their character count suggests
- Short messages: Per-message overhead (role markers, structure) isn't accounted for
- JSON/punctuation-heavy content: Special characters tokenize less efficiently than plain text
Additional Bug
The _pretrim_large_messages function in trimmer.py calls count_tokens([message]) without passing include_tool_tokens=False, causing it to add ~5,650 tool tokens to every single message check. This makes nearly all messages appear to exceed the single-message limit, triggering unnecessary placeholder replacements.
Proposal
Replace the current ApproximateTokenCounter with tiktoken for more accurate token counting. It's OpenAI's official tokenizer and handles all the edge cases (Unicode, message overhead, etc.) that our approximation misses.
Also fix the _pretrim_large_messages bug by passing include_tool_tokens=False when checking individual message sizes.
Further details
Failure Cases
Tested against tiktoken (OpenAI's actual tokenizer):
| Content Type | ApproximateTokenCounter | Tiktoken | Error |
|---|---|---|---|
Severe Unicode ("中文" * 1000) |
757 | 1,000 | -24.3% |
| Short messages | 10 | 27 | -63.0% |
| Unicode/emoji heavy | 271 | 564 | -52.0% |
| JSON heavy | 423 | 453 | -6.6% |
Links / references
- Problem originated: https://gitlab.slack.com/archives/C08TUAH45NG/p1765357125259899?thread_ts=1765220789.916229&cid=C08TUAH45NG
- Related MR (paused): !3474