fix: approximate tokens count using tiktoken

What does this merge request do and why?

This merge request (MR) replaces the current approach of counting tokens by using tiktoken, OpenAI token counting tool.

This MR is needed because the current token counter, ApproximateTokenCounter, can severely undercount in some test cases, creating "prompt is too long" errors.

Token Counting Strategy

The old ApproximateTokenCounter used a simple formula (len // 4 * 1.5) that worked decently for English text but yielded inaccurate counts for Unicode and emojis, with up to 60% undercounting in some cases (see issue).

This MR replaces it with tiktoken, but since tiktoken is slow on large strings (~17ms for 10KB), we use a hybrid approach inspired by stratified sampling.

The logic is as follows:

  • Strings up to 1500 chars: exact tiktoken counting (accuracy matters most here)
  • Strings over 1500 chars: sample 500 chars from the start, middle, and end, count those, and extrapolate.

The sampling works well for large content because token density tends to be consistent throughout a file. We sample from three positions to handle cases where the start looks different from the rest (like imports in code files).

Latency Benchmark

Below you can find the results between the old ApproximateTokenCounter, tiktoken, and the proposed approach: tiktoken with sampling .

Test Case ApproximateTokenCounter (ms) tiktoken (ms) tiktoken with sampling (this MR, ms) sampling speedup
Short text (50 chars) 0.0001 0.0043 0.0047 0.9x
Medium text (1KB) 0.0001 0.1936 0.1932 1.0x
Large text (10KB) 0.0001 17.0386 0.1631 104.5x
Unicode heavy 0.0001 0.1134  0.1129  1.0x
Emoji heavy 0.0001 1.8347 1.8407 1.0x
JSON data 0.0001 0.2024  0.1491   1.4x

The proposed approach in this MR yields a good middleground between ApproximateTokenCounter and pure tiktoken.

Accuracy Benchmark

Below you can find the results between the old ApproximateTokenCounter, tiktoken, and the proposed approach: tiktoken with sampling .

Test Case tiktoken ApproximateTokenCounter tiktoken with sampling (this MR)
Short text (50 chars) 13 16 (+3) 13 (+0)
Medium text (1KB) 128 384 (+256) 128 (+0)
Large text (10KB) 1280 3840 (+2560) 1290 (+10)
Unicode heavy 600 375 (-225)  600 (+0)
Emoji heavy 1600 375 (-1225) 1600 (+0)
JSON data 800 730 (-70) 800 (+0)  

The proposed approach in this MR achieves significantly better accuracy compared with the old ApproximateTokenCounter, having minimal differences with respect to the ideal case of using pure tiktoken.

References

How to set up and validate locally

  1. Check out to this merge request's branch.

  2. Ensure you comply with the Prerequisites

  3. Install dependencies:

    poetry install
  4. Run the token counter tests:

    poetry run pytest tests/duo_workflow_service/token_counter/test_tiktoken_counter.py -v
  5. Verify token counting accuracy manually by running the following script:

    from duo_workflow_service.token_counter.tiktoken_counter import TikTokenCounter
    
    counter = TikTokenCounter("executor")
    
    # Test with Unicode, should return 30 tokens in the website
    text = "你好世界!" * 10
    print(f"Unicode: {counter.count_string_content(text)} tokens")
    
    # Test with emojis, should return 50 tokens in the website
    text = "🚀✨🎉" * 10
    print(f"Emojis: {counter.count_string_content(text)} tokens")
    
    # Test with plain English, should return 8 tokens in the website
    text = "Hello, how are you doing today?"
    print(f"English: {counter.count_string_content(text)} tokens")

    Compare the results with what you get in https://platform.openai.com/tokenizer, selecting gpt-4o .

Merge request checklist

  • Tests added for new functionality. If not, please raise an issue to follow up.
  • Documentation added/updated, if needed.
  • If this change requires executor implementation: verified that issues/MRs exist for both Go executor and Node executor or confirmed that changes are backward-compatible and don't break existing executor functionality.

Closes #1740

Edited by Fabrizio J. Piva

Merge request reports

Loading