fix: improve accuracy of tiktoken counter and make it more conservative

What does this merge request do and why?

This MR improves the accuracy of the existing tiktoken token counting strategy by sampling n_samples times on the string equally spaced and centered around the middle of the string, to achieve a homogeneous sampling across the string.

Token Counting Strategy

In this MR, we propose to use the trimmed mean strategy. This strategy removes extreme outliers before averaging. The logic works as follows:

  1. Sample the file: Take 30 equally-spaced samples of 300 characters each, centered within their respective regions
  2. Calculate density: For each sample, compute tokens per character using tiktoken
  3. Sort the densities: Order from lowest to highest
  4. Trim the extremes: Discard the bottom 10% and top 10% of samples (3 lowest + 3 highest with 30 samples)
  5. Average the middle 80%: Use the remaining 24 samples to calculate the average density
  6. Apply safety buffer: Multiply by 1.10 to add a small safety buffer of 10%.

Latency & Accuracy Benchmarks

Below you can find benchmarks for different test cases:

  • Approximate refers to the old ApproximateTokenCounter method,
  • Sampling refers to the method proposed in this MR, and
  • Pure Tiktoken to ideally using tiktoken for the full string.

Takeaways:

  • Latency: The proposed approach is up to 24x faster than pure tiktoken on large files (~0.7ms vs ~17ms for 10KB) by encoding only 30 small samples instead of the entire content.
  • Accuracy: The old counter undercounted tokens on Unicode and emojis, causing "prompt is too long" errors. The new sampling approach never undercounts, with a safe 10-15% overestimate on large files. Files under 9KB use exact counting.
  • Safety: adding a 10% buffer after the token count is a healthy way to keep the counter slightly overcounting (+5%), preventing undercounting that can cause prompt is too long errors. Feel free to compare the accuracy of the sampling strategy when we add a 10% safety buffer (Table 2) vs 5% safety buffer (Table 3) vs no safety buffer (Table 4)
==============================================================================================================
LATENCY BENCHMARK (Table 1)
==============================================================================================================
Test Case                           Approximate (ms)   Sampling (ms)      Pure Tiktoken (ms)   Sampling Speedup
--------------------------------------------------------------------------------------------------------------
Short text (50 chars)               0.0002             0.0072             0.0045               0.6x
Medium text (1KB)                   0.0001             0.1927             0.1912               1.0x
Large text (10KB)                   0.0001             0.7042             16.6883              23.7x
Unicode heavy                       0.0001             0.1118             0.1090               1.0x
Emoji heavy                         0.0001             1.7986             1.8072               1.0x
JSON data                           0.0002             0.1853             0.1883               1.0x
Code w/ dense imports (10KB)        0.0001             0.7680             0.8944               1.2x
Mixed density alternating (10KB)    0.0001             1.2977             30.0587              23.2x
JSON with Unicode values (10KB)     0.0001             0.6056             0.6062               1.0x
Minified JavaScript (10KB)          0.0001             0.8755             0.9575               1.1x
Log file with timestamps (10KB)     0.0001             0.8054             0.8047               1.0x
Sparse→Dense (10KB)                 0.0002             2.5226             52.9567              21.0x
Dense→Sparse (10KB)                 0.0002             1.0247             1.6565               1.6x
Dense middle only (10KB)            0.0002             0.9849             1.7882               1.8x
Random density spikes (10KB)        0.0001             0.9190             0.9754               1.1x
Very large mixed density (50KB)     0.0001             1.2670             901.3528             711.4x
Very large random spikes (50KB)     0.0002             0.9630             5.7190               5.9x
Very large sparse→dense (50KB)      0.0002             2.6002             1401.9423            539.2x
==============================================================================================================

==============================================================================================================
ACCURACY BENCHMARK (Table 2, difference from Pure Tiktoken & adding 10% safety buffer)
==============================================================================================================
Test Case                           Pure Tiktoken   Approximate          Sampling             Sampling %Err
--------------------------------------------------------------------------------------------------------------
Short text (50 chars)               13              16 (+3)              13 (+0)              +0.0%
Medium text (1KB)                   128             384 (+256)           128 (+0)             +0.0%
Large text (10KB)                   1280            3840 (+2560)         1426 (+146)          +11.4%
Unicode heavy                       600             375 (-225)           600 (+0)             +0.0%
Emoji heavy                         1600            375 (-1225)          1600 (+0)            +0.0%
JSON data                           800             730 (-70)            800 (+0)             +0.0%
Code w/ dense imports (10KB)        2827            3994 (+1167)         3125 (+298)          +10.5%
Mixed density alternating (10KB)    4430            3675 (-755)          4957 (+527)          +11.9%
JSON with Unicode values (10KB)     3254            2286 (-968)          3254 (+0)            +0.0%
Minified JavaScript (10KB)          4144            3840 (-304)          4593 (+449)          +10.8%
Log file with timestamps (10KB)     3448            3152 (-296)          3448 (+0)            +0.0%
Sparse→Dense (10KB)                 13312           5760 (-7552)         13908 (+596)         +4.5%
Dense→Sparse (10KB)                 8193            5760 (-2433)         8985 (+792)          +9.7%
Dense middle only (10KB)            10235           6398 (-3837)         11775 (+1540)        +15.0%
Random density spikes (10KB)        5935            3840 (-2095)         6553 (+618)          +10.4%
Very large mixed density (50KB)     23036           19110 (-3926)        25388 (+2352)        +10.2%
Very large random spikes (50KB)     29638           19200 (-10438)       32595 (+2957)        +10.0%
Very large sparse→dense (50KB)      66560           28800 (-37760)       69543 (+2983)        +4.5%
==============================================================================================================

==============================================================================================================
ACCURACY BENCHMARK (Table 3, difference from Pure Tiktoken & adding 5% safety buffer)
==============================================================================================================
Test Case                           Pure Tiktoken   Approximate          Sampling             Sampling %Err
--------------------------------------------------------------------------------------------------------------
Short text (50 chars)               13              16 (+3)              13 (+0)              +0.0%
Medium text (1KB)                   128             384 (+256)           128 (+0)             +0.0%
Large text (10KB)                   1280            3840 (+2560)         1361 (+81)           +6.3%
Unicode heavy                       600             375 (-225)           600 (+0)             +0.0%
Emoji heavy                         1600            375 (-1225)          1600 (+0)            +0.0%
JSON data                           800             730 (-70)            800 (+0)             +0.0%
Code w/ dense imports (10KB)        2827            3994 (+1167)         2983 (+156)          +5.5%
Mixed density alternating (10KB)    4430            3675 (-755)          4731 (+301)          +6.8%
JSON with Unicode values (10KB)     3254            2286 (-968)          3254 (+0)            +0.0%
Minified JavaScript (10KB)          4144            3840 (-304)          4384 (+240)          +5.8%
Log file with timestamps (10KB)     3448            3152 (-296)          3448 (+0)            +0.0%
Sparse→Dense (10KB)                 13312           5760 (-7552)         13276 (-36)          -0.3%
Dense→Sparse (10KB)                 8193            5760 (-2433)         8576 (+383)          +4.7%
Dense middle only (10KB)            10235           6398 (-3837)         11240 (+1005)        +9.8%
Random density spikes (10KB)        5935            3840 (-2095)         6255 (+320)          +5.4%
Very large mixed density (50KB)     23036           19110 (-3926)        24234 (+1198)        +5.2%
Very large random spikes (50KB)     29638           19200 (-10438)       31113 (+1475)        +5.0%
Very large sparse→dense (50KB)      66560           28800 (-37760)       66382 (-178)         -0.3%
==============================================================================================================

==============================================================================================================
ACCURACY BENCHMARK (Table 4, difference from Pure Tiktoken & no safety buffer)
==============================================================================================================
Test Case                           Pure Tiktoken   Approximate          Sampling             Sampling %Err
--------------------------------------------------------------------------------------------------------------
Short text (50 chars)               13              16 (+3)              13 (+0)              +0.0%
Medium text (1KB)                   128             384 (+256)           128 (+0)             +0.0%
Large text (10KB)                   1280            3840 (+2560)         1297 (+17)           +1.3%
Unicode heavy                       600             375 (-225)           600 (+0)             +0.0%
Emoji heavy                         1600            375 (-1225)          1600 (+0)            +0.0%
JSON data                           800             730 (-70)            800 (+0)             +0.0%
Code w/ dense imports (10KB)        2827            3994 (+1167)         2841 (+14)           +0.5%
Mixed density alternating (10KB)    4430            3675 (-755)          4506 (+76)           +1.7%
JSON with Unicode values (10KB)     3254            2286 (-968)          3254 (+0)            +0.0%
Minified JavaScript (10KB)          4144            3840 (-304)          4175 (+31)           +0.7%
Log file with timestamps (10KB)     3448            3152 (-296)          3448 (+0)            +0.0%
Sparse→Dense (10KB)                 13312           5760 (-7552)         12644 (-668)         -5.0%
Dense→Sparse (10KB)                 8193            5760 (-2433)         8168 (-25)           -0.3%
Dense middle only (10KB)            10235           6398 (-3837)         10705 (+470)         +4.6%
Random density spikes (10KB)        5935            3840 (-2095)         5957 (+22)           +0.4%
Very large mixed density (50KB)     23036           19110 (-3926)        23080 (+44)          +0.2%
Very large random spikes (50KB)     29638           19200 (-10438)       29632 (-6)           -0.0%
Very large sparse→dense (50KB)      66560           28800 (-37760)       63221 (-3339)        -5.0%
==============================================================================================================

References

How to set up and validate locally

  1. Check out to this merge request's branch.

  2. Ensure you comply with the Prerequisites

  3. Install dependencies:

    poetry install
  4. Run the token counter tests:

    poetry run pytest tests/duo_workflow_service/token_counter/test_tiktoken_counter.py -v
  5. Verify token counting accuracy manually by running the following script:

    from duo_workflow_service.token_counter.tiktoken_counter import TikTokenCounter
    
    counter = TikTokenCounter("executor")
    
    # Test with Unicode, should return 30 tokens in the website
    text = "你好世界!" * 10
    print(f"Unicode: {counter.count_string_content(text)} tokens")
    
    # Test with emojis, should return 50 tokens in the website
    text = "🚀✨🎉" * 10
    print(f"Emojis: {counter.count_string_content(text)} tokens")
    
    # Test with plain English, should return 8 tokens in the website
    text = "Hello, how are you doing today?"
    print(f"English: {counter.count_string_content(text)} tokens")

    Compare the results with what you get in https://platform.openai.com/tokenizer, selecting gpt-4o .

Merge request checklist

  • Tests added for new functionality. If not, please raise an issue to follow up.
  • Documentation added/updated, if needed.
  • If this change requires executor implementation: verified that issues/MRs exist for both Go executor and Node executor or confirmed that changes are backward-compatible and don't break existing executor functionality.
Edited by Fabrizio J. Piva

Merge request reports

Loading