fix: improve accuracy of tiktoken counter and make it more conservative
What does this merge request do and why?
This MR improves the accuracy of the existing tiktoken token counting strategy by sampling n_samples times on the string equally spaced and centered around the middle of the string, to achieve a homogeneous sampling across the string.
Token Counting Strategy
In this MR, we propose to use the trimmed mean strategy. This strategy removes extreme outliers before averaging. The logic works as follows:
- Sample the file: Take 30 equally-spaced samples of 300 characters each, centered within their respective regions
-
Calculate density: For each sample, compute tokens per character using
tiktoken - Sort the densities: Order from lowest to highest
- Trim the extremes: Discard the bottom 10% and top 10% of samples (3 lowest + 3 highest with 30 samples)
- Average the middle 80%: Use the remaining 24 samples to calculate the average density
- Apply safety buffer: Multiply by 1.10 to add a small safety buffer of 10%.
Latency & Accuracy Benchmarks
Below you can find benchmarks for different test cases:
-
Approximaterefers to the oldApproximateTokenCountermethod, -
Samplingrefers to the method proposed in this MR, and -
Pure Tiktokento ideally usingtiktokenfor the full string.
Takeaways:
-
Latency: The proposed approach is up to 24x faster than pure
tiktokenon large files (~0.7ms vs ~17ms for 10KB) by encoding only 30 small samples instead of the entire content. -
Accuracy: The old counter undercounted tokens on Unicode and emojis, causing
"prompt is too long"errors. The new sampling approach never undercounts, with a safe 10-15% overestimate on large files. Files under 9KB use exact counting. -
Safety: adding a 10% buffer after the token count is a healthy way to keep the counter slightly overcounting (+5%), preventing undercounting that can cause
prompt is too longerrors. Feel free to compare the accuracy of the sampling strategy when we add a 10% safety buffer (Table 2) vs 5% safety buffer (Table 3) vs no safety buffer (Table 4)
==============================================================================================================
LATENCY BENCHMARK (Table 1)
==============================================================================================================
Test Case Approximate (ms) Sampling (ms) Pure Tiktoken (ms) Sampling Speedup
--------------------------------------------------------------------------------------------------------------
Short text (50 chars) 0.0002 0.0072 0.0045 0.6x
Medium text (1KB) 0.0001 0.1927 0.1912 1.0x
Large text (10KB) 0.0001 0.7042 16.6883 23.7x
Unicode heavy 0.0001 0.1118 0.1090 1.0x
Emoji heavy 0.0001 1.7986 1.8072 1.0x
JSON data 0.0002 0.1853 0.1883 1.0x
Code w/ dense imports (10KB) 0.0001 0.7680 0.8944 1.2x
Mixed density alternating (10KB) 0.0001 1.2977 30.0587 23.2x
JSON with Unicode values (10KB) 0.0001 0.6056 0.6062 1.0x
Minified JavaScript (10KB) 0.0001 0.8755 0.9575 1.1x
Log file with timestamps (10KB) 0.0001 0.8054 0.8047 1.0x
Sparse→Dense (10KB) 0.0002 2.5226 52.9567 21.0x
Dense→Sparse (10KB) 0.0002 1.0247 1.6565 1.6x
Dense middle only (10KB) 0.0002 0.9849 1.7882 1.8x
Random density spikes (10KB) 0.0001 0.9190 0.9754 1.1x
Very large mixed density (50KB) 0.0001 1.2670 901.3528 711.4x
Very large random spikes (50KB) 0.0002 0.9630 5.7190 5.9x
Very large sparse→dense (50KB) 0.0002 2.6002 1401.9423 539.2x
==============================================================================================================
==============================================================================================================
ACCURACY BENCHMARK (Table 2, difference from Pure Tiktoken & adding 10% safety buffer)
==============================================================================================================
Test Case Pure Tiktoken Approximate Sampling Sampling %Err
--------------------------------------------------------------------------------------------------------------
Short text (50 chars) 13 16 (+3) 13 (+0) +0.0%
Medium text (1KB) 128 384 (+256) 128 (+0) +0.0%
Large text (10KB) 1280 3840 (+2560) 1426 (+146) +11.4%
Unicode heavy 600 375 (-225) 600 (+0) +0.0%
Emoji heavy 1600 375 (-1225) 1600 (+0) +0.0%
JSON data 800 730 (-70) 800 (+0) +0.0%
Code w/ dense imports (10KB) 2827 3994 (+1167) 3125 (+298) +10.5%
Mixed density alternating (10KB) 4430 3675 (-755) 4957 (+527) +11.9%
JSON with Unicode values (10KB) 3254 2286 (-968) 3254 (+0) +0.0%
Minified JavaScript (10KB) 4144 3840 (-304) 4593 (+449) +10.8%
Log file with timestamps (10KB) 3448 3152 (-296) 3448 (+0) +0.0%
Sparse→Dense (10KB) 13312 5760 (-7552) 13908 (+596) +4.5%
Dense→Sparse (10KB) 8193 5760 (-2433) 8985 (+792) +9.7%
Dense middle only (10KB) 10235 6398 (-3837) 11775 (+1540) +15.0%
Random density spikes (10KB) 5935 3840 (-2095) 6553 (+618) +10.4%
Very large mixed density (50KB) 23036 19110 (-3926) 25388 (+2352) +10.2%
Very large random spikes (50KB) 29638 19200 (-10438) 32595 (+2957) +10.0%
Very large sparse→dense (50KB) 66560 28800 (-37760) 69543 (+2983) +4.5%
==============================================================================================================
==============================================================================================================
ACCURACY BENCHMARK (Table 3, difference from Pure Tiktoken & adding 5% safety buffer)
==============================================================================================================
Test Case Pure Tiktoken Approximate Sampling Sampling %Err
--------------------------------------------------------------------------------------------------------------
Short text (50 chars) 13 16 (+3) 13 (+0) +0.0%
Medium text (1KB) 128 384 (+256) 128 (+0) +0.0%
Large text (10KB) 1280 3840 (+2560) 1361 (+81) +6.3%
Unicode heavy 600 375 (-225) 600 (+0) +0.0%
Emoji heavy 1600 375 (-1225) 1600 (+0) +0.0%
JSON data 800 730 (-70) 800 (+0) +0.0%
Code w/ dense imports (10KB) 2827 3994 (+1167) 2983 (+156) +5.5%
Mixed density alternating (10KB) 4430 3675 (-755) 4731 (+301) +6.8%
JSON with Unicode values (10KB) 3254 2286 (-968) 3254 (+0) +0.0%
Minified JavaScript (10KB) 4144 3840 (-304) 4384 (+240) +5.8%
Log file with timestamps (10KB) 3448 3152 (-296) 3448 (+0) +0.0%
Sparse→Dense (10KB) 13312 5760 (-7552) 13276 (-36) -0.3%
Dense→Sparse (10KB) 8193 5760 (-2433) 8576 (+383) +4.7%
Dense middle only (10KB) 10235 6398 (-3837) 11240 (+1005) +9.8%
Random density spikes (10KB) 5935 3840 (-2095) 6255 (+320) +5.4%
Very large mixed density (50KB) 23036 19110 (-3926) 24234 (+1198) +5.2%
Very large random spikes (50KB) 29638 19200 (-10438) 31113 (+1475) +5.0%
Very large sparse→dense (50KB) 66560 28800 (-37760) 66382 (-178) -0.3%
==============================================================================================================
==============================================================================================================
ACCURACY BENCHMARK (Table 4, difference from Pure Tiktoken & no safety buffer)
==============================================================================================================
Test Case Pure Tiktoken Approximate Sampling Sampling %Err
--------------------------------------------------------------------------------------------------------------
Short text (50 chars) 13 16 (+3) 13 (+0) +0.0%
Medium text (1KB) 128 384 (+256) 128 (+0) +0.0%
Large text (10KB) 1280 3840 (+2560) 1297 (+17) +1.3%
Unicode heavy 600 375 (-225) 600 (+0) +0.0%
Emoji heavy 1600 375 (-1225) 1600 (+0) +0.0%
JSON data 800 730 (-70) 800 (+0) +0.0%
Code w/ dense imports (10KB) 2827 3994 (+1167) 2841 (+14) +0.5%
Mixed density alternating (10KB) 4430 3675 (-755) 4506 (+76) +1.7%
JSON with Unicode values (10KB) 3254 2286 (-968) 3254 (+0) +0.0%
Minified JavaScript (10KB) 4144 3840 (-304) 4175 (+31) +0.7%
Log file with timestamps (10KB) 3448 3152 (-296) 3448 (+0) +0.0%
Sparse→Dense (10KB) 13312 5760 (-7552) 12644 (-668) -5.0%
Dense→Sparse (10KB) 8193 5760 (-2433) 8168 (-25) -0.3%
Dense middle only (10KB) 10235 6398 (-3837) 10705 (+470) +4.6%
Random density spikes (10KB) 5935 3840 (-2095) 5957 (+22) +0.4%
Very large mixed density (50KB) 23036 19110 (-3926) 23080 (+44) +0.2%
Very large random spikes (50KB) 29638 19200 (-10438) 29632 (-6) -0.0%
Very large sparse→dense (50KB) 66560 28800 (-37760) 63221 (-3339) -5.0%
==============================================================================================================
References
- Parent issue: #1740
How to set up and validate locally
-
Check out to this merge request's branch.
-
Ensure you comply with the Prerequisites
-
Install dependencies:
poetry install -
Run the token counter tests:
poetry run pytest tests/duo_workflow_service/token_counter/test_tiktoken_counter.py -v -
Verify token counting accuracy manually by running the following script:
from duo_workflow_service.token_counter.tiktoken_counter import TikTokenCounter counter = TikTokenCounter("executor") # Test with Unicode, should return 30 tokens in the website text = "你好世界!" * 10 print(f"Unicode: {counter.count_string_content(text)} tokens") # Test with emojis, should return 50 tokens in the website text = "🚀✨🎉" * 10 print(f"Emojis: {counter.count_string_content(text)} tokens") # Test with plain English, should return 8 tokens in the website text = "Hello, how are you doing today?" print(f"English: {counter.count_string_content(text)} tokens")Compare the results with what you get in https://platform.openai.com/tokenizer, selecting
gpt-4o.
Merge request checklist
- Tests added for new functionality. If not, please raise an issue to follow up.
- Documentation added/updated, if needed.
- If this change requires executor implementation: verified that issues/MRs exist for both Go executor and Node executor or confirmed that changes are backward-compatible and don't break existing executor functionality.
Edited by Fabrizio J. Piva