fix: approximate tokens count using tiktoken
What does this merge request do and why?
This merge request (MR) replaces the current approach of counting tokens by using tiktoken, OpenAI token counting tool.
This MR is needed because the current token counter, ApproximateTokenCounter, can severely undercount in some test cases, creating "prompt is too long" errors.
Token Counting Strategy
The old ApproximateTokenCounter used a simple formula (len // 4 * 1.5) that worked decently for English text but yielded inaccurate counts for Unicode and emojis, with up to 60% undercounting in some cases (see issue).
This MR replaces it with tiktoken, but since tiktoken is slow on large strings (~17ms for 10KB), we use a hybrid approach inspired by stratified sampling.
The logic is as follows:
- Strings up to 1500 chars: exact
tiktokencounting (accuracy matters most here) - Strings over 1500 chars: sample 500 chars from the start, middle, and end, count those, and extrapolate.
The sampling works well for large content because token density tends to be consistent throughout a file. We sample from three positions to handle cases where the start looks different from the rest (like imports in code files).
Latency Benchmark
Below you can find the results between the old ApproximateTokenCounter, tiktoken, and the proposed approach: tiktoken with sampling .
| Test Case | ApproximateTokenCounter (ms) | tiktoken (ms) | tiktoken with sampling (this MR, ms) | sampling speedup |
|---|---|---|---|---|
| Short text (50 chars) | 0.0001 | 0.0043 | 0.0047 | 0.9x |
| Medium text (1KB) | 0.0001 | 0.1936 | 0.1932 | 1.0x |
| Large text (10KB) | 0.0001 | 17.0386 | 0.1631 | 104.5x |
| Unicode heavy | 0.0001 | 0.1134 | 0.1129 | 1.0x |
| Emoji heavy | 0.0001 | 1.8347 | 1.8407 | 1.0x |
| JSON data | 0.0001 | 0.2024 | 0.1491 | 1.4x |
The proposed approach in this MR yields a good middleground between ApproximateTokenCounter and pure tiktoken.
Accuracy Benchmark
Below you can find the results between the old ApproximateTokenCounter, tiktoken, and the proposed approach: tiktoken with sampling .
| Test Case | tiktoken | ApproximateTokenCounter | tiktoken with sampling (this MR) |
|---|---|---|---|
| Short text (50 chars) | 13 | 16 (+3) | 13 (+0) |
| Medium text (1KB) | 128 | 384 (+256) | 128 (+0) |
| Large text (10KB) | 1280 | 3840 (+2560) | 1290 (+10) |
| Unicode heavy | 600 | 375 (-225) | 600 (+0) |
| Emoji heavy | 1600 | 375 (-1225) | 1600 (+0) |
| JSON data | 800 | 730 (-70) | 800 (+0) |
The proposed approach in this MR achieves significantly better accuracy compared with the old ApproximateTokenCounter, having minimal differences with respect to the ideal case of using pure tiktoken.
References
- Parent issue: #1740
- Stratified sampling: https://en.wikipedia.org/wiki/Stratified_sampling
How to set up and validate locally
-
Check out to this merge request's branch.
-
Ensure you comply with the Prerequisites
-
Install dependencies:
poetry install -
Run the token counter tests:
poetry run pytest tests/duo_workflow_service/token_counter/test_tiktoken_counter.py -v -
Verify token counting accuracy manually by running the following script:
from duo_workflow_service.token_counter.tiktoken_counter import TikTokenCounter counter = TikTokenCounter("executor") # Test with Unicode, should return 30 tokens in the website text = "你好世界!" * 10 print(f"Unicode: {counter.count_string_content(text)} tokens") # Test with emojis, should return 50 tokens in the website text = "🚀✨🎉" * 10 print(f"Emojis: {counter.count_string_content(text)} tokens") # Test with plain English, should return 8 tokens in the website text = "Hello, how are you doing today?" print(f"English: {counter.count_string_content(text)} tokens")Compare the results with what you get in https://platform.openai.com/tokenizer, selecting
gpt-4o.
Merge request checklist
- Tests added for new functionality. If not, please raise an issue to follow up.
- Documentation added/updated, if needed.
- If this change requires executor implementation: verified that issues/MRs exist for both Go executor and Node executor or confirmed that changes are backward-compatible and don't break existing executor functionality.
Closes #1740