Spike: Investigate character limits for Duo Chat

A customer wanted to know what happens when the files in an MR are too many to fit in the context window. What happens then?

Lesley already dug up that for some reason, the character limit for merge request-related questions on Duo Chat is 96,000. That seems low as the token limit is 200,000 for Claude 3.5 Sonnet. (Usually there are about 4 characters per token.)

We also realized that we have other questions about how this truncation works in Duo Chat in general.

Acceptance criteria

The following should be recorded:

What is the character limit for MR, issue, epic, GitLab Documentation, and general questions on Duo Chat?
Why do we have those limits?
What happens if the limit is exceeded? E.g. during the conversation, does it truncate the newest messages or the oldest messages? In other words, is there a point in the conversation where Duo Chat is always answering old questions, as the newer ones are cut off?
Do we track how often it happens that we truncate a conversation or a question? Apparently not, according to @nateweinshenker's comment in https://gitlab.slack.com/archives/C051K31F30R/p1721318066274069?thread_ts=1721072092.059389&cid=C051K31F30R. (Related issues: #472681 (closed) and gitlab-org/modelops/applied-ml/code-suggestions/ai-assist#554)
Report the findings to @tlinz so we can decide if any adjustments are needed.

Dev Note

To test the second question, hardcode a small limit so you can observe the effects easier.

Results

Character Limits

Per-message Limit (outdated - no longer exists due to this MR)

There is a 400,000 character limit for any single message on Duo Chat (code).

Total Input Limit

There is a limit of 200,000 tokens (about 680k characters) total for the input to the model. The main difference with the above limit is that this:

Includes the conversation history
Comes directly from Anthropic

Serialized Content Limits

The serialized content (issues, epics, MRs) has a limit of 192,000 characters. (code) For merge requests, this is split into 96,000 characters for diff and 96,000 characters for comments (code)

What this means it that if an issue, epic, MR, etc has content that goes above this limit, it gets cutoff from the end. The user gets no indication this has happened.

This is based on an assumption that the limit is 100k tokens for Anthropic. (code). This is outdated, and can be safely bumped up to 200k. However, some would still argue that it's best not to use up almost the entire limit with a single serialized resource.

The exact math is:

Total token limit: 100,000 tokens
Input token limit: 80,000 tokens (100,000 × 0.8)
- Uses 80% of total limit as safety buffer
Maximum characters: 320,000 characters (80,000 tokens × 4 chars/token)
- Based on estimate of 4 characters per token
Resource serialization: 192,000 characters (320,000 × 0.6)
- Serializes context with character limit of 60% of max characters to leave room for prompt

What Happens When Limits Are Exceeded

Limit Type	Result
Per-message	User gets an 'A9999' error
Total input	User receives: "I'm sorry, you've entered too many prompts. Please run `/clear` or `/reset` before asking the next question."
Serialized content	Content gets truncated when it exceeds the limit.

Only the 'total input' and 'serialized content' limits still exist.

Tracking

Currently, no limits are specifically tracked. However, total input limit exceptions can be monitored through Kibana logs.

Other work / discussions that is worth taking note of:

Edited Nov 14, 2024 by Lesley Razzaghian