Token counter undercounts in several test cases causing prompt is too long errors

Problem to solve

As a Duo Workflow user, I want the conversation trimmer to accurately count tokens, so I can avoid "prompt is too long" errors during my workflow sessions.

The ApproximateTokenCounter undercounts tokens in several scenarios, causing the trimmer to believe there's more context window space than actually exists. This results in prompts that exceed the model's limit.

Root Cause

The token counting formula len(content) // 4 * 1.5 assumes ~4 characters per token with a 50% buffer. This breaks down for:

  • Unicode/non-ASCII content: Chinese, Japanese, emojis tokenize into many more tokens than their character count suggests
  • Short messages: Per-message overhead (role markers, structure) isn't accounted for
  • JSON/punctuation-heavy content: Special characters tokenize less efficiently than plain text

Additional Bug

The _pretrim_large_messages function in trimmer.py calls count_tokens([message]) without passing include_tool_tokens=False, causing it to add ~5,650 tool tokens to every single message check. This makes nearly all messages appear to exceed the single-message limit, triggering unnecessary placeholder replacements.

Proposal

Replace the current ApproximateTokenCounter with tiktoken for more accurate token counting. It's OpenAI's official tokenizer and handles all the edge cases (Unicode, message overhead, etc.) that our approximation misses.

Also fix the _pretrim_large_messages bug by passing include_tool_tokens=False when checking individual message sizes.

Further details

Failure Cases

Tested against tiktoken (OpenAI's actual tokenizer):

Content Type ApproximateTokenCounter Tiktoken Error
Severe Unicode ("中文" * 1000) 757 1,000 -24.3%
Short messages 10 27 -63.0%
Unicode/emoji heavy 271 564 -52.0%
JSON heavy 423 453 -6.6%
Edited by Fabrizio J. Piva