Update Duo Chat to use Anthropic Claude 3 Sonnet (rather than 2.1)

marked this issue as related to #432803 (closed)

changed title from Update Claude to 2.0 to Update Claude to 3.0

changed milestone to %16.9

added Category:Duo Chat backend devopsai-powered groupai framework sectiondata-science typemaintenance workflowcomplete labels

changed milestone to %16.10

added workflowready for development label and removed workflowcomplete label

@jessieay do you have time to pick this up? How long do you think it will take?

@m_gill @oregand

Started looking into this today.

Claude 3.0 requires use of the Messages API. When I send a request to any of the Claude 3 models using our existing AI Gateway logic, I get this error:

{'type': 'invalid_request_error', 'message': '\"claude-3-sonnet-20240229\" is not supported on this API. Please use the Messages API instead.'}

We have an existing issue for upgrading the AI Gateway to use the Anthropic Messages API but today we still use the Text completions API. So, this seems like the move to Claude 3 will require more work than the Claude 2.0 -> 2.1 upgrade.

I will dig a bit more and report back on how much work it appears to be. The migrations guide makes it sound pretty straightforward but I am not sure what mechanisms we have in the AI Gateway for making changes

Confirmed via Anthropic docs that we need to use messages API:

Our latest and most powerful Claude 3 models (Haiku, Sonnet, and Opus) can only be called via the Messages API. By upgrading, you'll be able to take advantage of their enhanced performance and capabilities.

There are no tests yet but I was able to update the AI Gateway to work with the Claude 3 models. Draft MR: gitlab-org/modelops/applied-ml/code-suggestions/ai-assist!668 (merged)

Once this MR is merged, the next step would be to update the monolith to, behind a feature flag, send the Claude 3 model name as a header (similar to what we did for Claude 2.1) so that we could opt specific users into Claude 3 for local testing. We should also work with Model Validation to see how we could run the full prompt library test suite for Claude 3 to compare stats.

cc @oregand @m_gill

It seems like it will be a stretch to get this merged before the PCL but I will try

Thats awesome @jessieay ! Thank you so much for your awesome work here!

Once this MR is merged, the next step would be to update the monolith to, behind a feature flag, send the Claude 3 model name as a header (similar to what we did for Claude 2.1) so that we could opt specific users into Claude 3 for local testing.

This is outstanding!

We should also work with Model Validation to see how we could run the full prompt library test suite for Claude 3 to compare stats.

@achueshev @bcardoso-

Do you think we could look to scope the the full prompt library test suite against a single user that we toggle a feature flag on for these type of scenarios?

Yes, thank you so much, @jessieay. I'll change the title to include the word Chat to make it more clear. Please, revert if I misunderstood, @oregand

I have also added Experiment Claude-3 foundational API with Clau... (gitlab-org/modelops/ai-model-validation-and-research/ai-evaluation/ai-experiments#17 - closed) as blocking. The intention is to test the change first before switching to Claude 3 officially. I know this is anyway the intention but just to tie these issues together.

We should also work with Model Validation to see how we could run the full prompt library test suite for Claude 3 to compare stats.

Should be straight-forward. I see that claude-3 has already been added to prompt-library gitlab-org/modelops/ai-model-validation-and-research/ai-evaluation/prompt-library!314 (merged).

@bcardoso-

Would you be keen to take the lead on run the full prompt library test suite for Claude 3 to compare stats.? We are eager to get claude-3 evaluated ASAP so we can look to migrate sooner rather than later if the gains are significant

@oregand yes, very happy to pick it up. Quick question though, isn't this the same as what @mray2020 shared here?

@bcardoso-

That would be super, AFAIK, before we are able to migrate fully to one of the C3 models, we will need to have a full evaluation run on it which likely overlaps with https://gitlab.com/gitlab-org/modelops/ai-model-validation-and-research/ai-evaluation/prompt-library/-/issues/180#note_1802822178

@mray2020

Can you let me and @bcardoso- know what steps we need to take to handle the eval process so we can keep this MR moving forward?

@oregand and @bcardoso- we have already evaluated Claude 3 and the dashboards are updated.

The next steps is this experiment : gitlab-org/modelops/ai-model-validation-and-research/ai-evaluation/ai-experiments#17 (closed) . Thanks @bcardoso- for picking this up.

This is to be able to locally switch the claude 3 family both models api with claude 2.1 for duo chat and measure any impact for the subset of data where the control is the current set up of duo-chat. Super excited for this . More details in the issue.

Further any support needed on the experiment, @tle_gitlab would be able to support from the model validation team

marked this issue as related to #444664 (closed)

changed milestone to %16.11

assigned to @jessieay

changed milestone to %16.10

added workflowin dev label and removed workflowready for development label

mentioned in merge request gitlab-org/modelops/applied-ml/code-suggestions/ai-assist!668 (merged)

mentioned in commit gitlab-org/modelops/applied-ml/code-suggestions/ai-assist@6fb9fac3

mentioned in commit gitlab-org/modelops/applied-ml/code-suggestions/ai-assist@d4045d2f

mentioned in commit gitlab-org/modelops/applied-ml/code-suggestions/ai-assist@eb1b7814

mentioned in issue #446302 (closed)

marked this issue as related to gitlab-org/modelops/ai-model-validation-and-research/ai-evaluation/ai-experiments#17 (closed)

changed title from Update Claude to 3.0 to Update Duo Chat to use Claude 3.0 (rather than 2.1)

removed the relation with gitlab-org/modelops/ai-model-validation-and-research/ai-evaluation/ai-experiments#17 (closed)

marked this issue as blocked by gitlab-org/modelops/ai-model-validation-and-research/ai-evaluation/ai-experiments#17 (closed)

mentioned in merge request gitlab-org/modelops/applied-ml/code-suggestions/ai-assist!671 (merged)

mentioned in issue gitlab-org/modelops/ai-model-validation-and-research/ai-evaluation/ai-experiments#17 (closed)

mentioned in commit gitlab-org/modelops/applied-ml/code-suggestions/ai-assist@3b296bd5

changed milestone to %16.11

added missed:16.10 label

mentioned in commit gitlab-org/modelops/applied-ml/code-suggestions/ai-assist@96ad4f37

mentioned in commit gitlab-org/modelops/applied-ml/code-suggestions/ai-assist@4a9c7da1

@jessieay

With gitlab-org/modelops/applied-ml/code-suggestions/ai-assist!668 (merged) merged, I think we can open a MR to update the version for Duo Chat in the Monolith? We have a lot of support to move to C3 ASAP

/cc @achueshev @pwietchner @m_gill

@oregand yes I already have a branch with most of this work done from when I was working on this before Summit, will update it when I am back at work on Wednesday (tomorrow). Excited to give this a try!

The branch has the ability to use any of the 3 claude 3 models behind a feature flag. I confirmed with @tle_gitlab that the production model eval daily test runs for Duo Chat use a GitLab PAT belonging to a test user.

So, because the full model eval suite only runs on prod, to compare results for each model we can either:

do a separate / additional full prod test run with a different test user and flag each test user into the feature flag for each model
use existing daily prod test run and time the feature flag updates so that the existing test user is using a different Claude 3 model each day for 3 days.

First approach is more systematic and might provide a blueprint for how to test model changes going forward. Second approach is more manual but requires less upfront investment. Will make sure I sync with Model Validation on which way to go.

Hi @jessieay , this experiment @bcardoso- is working on gitlab-org/modelops/ai-model-validation-and-research/ai-evaluation/ai-experiments#17 (closed) , could support the testing result with the models as well .

@oregand + @jessieay + @bcardoso : How about focusing purely on using sonnet as the main drop in replacement model as a starting point. Also something that Anthropic suggests. To reduce the amount of evaluation into submodels for faster iteration. After that we can see if Haiku or Opus might be better models for some tools/use cases by experimenting isolated per tool. Wdyt?

I did some tests last week and code creation too before with the pre-release. Haiku performs badly for the 0-shot agent and complex code generation tasks for example but might be good enough for summarising etc. but that improvement still can come in another step. The win of just having C3-Sonnet as a new baseline is already great step forward.

The biggest impact I saw apart from moving to Claude3 was mainly seperating the instruction / system prompt out to support better conversational questions. And also following some of the currently not performing well issue/epic questions got from 0% to 100%. Do you plan to do this already in the first go or additional iterations?

@timzallmann

We can't do a "Drop in replacement" for any of the Claude 3 models right now because our prompts, which are hard-coded in the monolith, use a format that works for the Anthropic completions endpoint. The Messages endpoint was supported for Claude 2 models but we never adopted it. For Claude 3, we must us the Messages endpoint.

So, the work we need to do for Claude 3 in the monolith right now is really to build support for the Messages API. Once we have that support, we can start testing Claude 3 models for Chat and it sounds like Sonnet is the best one to start with.

Thanks @jessieay, yes I am aware that extra work is needed :-/ for taking care of it.

I wasn't clear enough what i meant with Drop-In, that I meant only about the models prompt/results/etc. As i saw in the other doc regarding evaluating all different Claude 3 types as a base model, i think we can skip that and use sonnet simply as the first step and then most probably need adaptions per tool.

@timzallmann @jessieay

As i saw in the other doc regarding evaluating all different Claude 3 types as a base model, i think we can skip that and use sonnet simply as the first step and then most probably need adaptions per tool.

This sounds like a great approach to me, I am all for it

added to epic &13297 (closed)

mentioned in merge request !148223 (merged)

One thing I realized as I've dug into this: we don't currently use Claude 2 for everything. Here is the model breakdown for Chat:

Claude 2.1 is the default. We Use it for: Zero shot prompt, Documentation tool, CI Editor assistant, Slash commands
We also use Claude-instant-1.2 for: Epic tool, Issue tool

For this Claude 3 effort, do we want to convert everything to use Claude 3, or just the elements that currently use Claude 2.1? I assume we use Claude Instance for Issues/Epics because that is the best model for those.

@bcardoso- @achueshev any insight / ideas?

@jessieay

This is something we are aware of and have raised an issue to help solve, see Centralize Instant Model Setting and Consider S... (#444664 - closed) • Jeff Park • 17.0 ideally we want to approach the work so that our end state is:

A Single Centralized Instant Model Setting
Everything using C3 rather than a split of models from various families

I think moving everything to C3 would be best.

@oregand Regarding the split, I expect this again as we will have different sub models for the tools (as long as we use the same haiku version for all :-)) and/or use maybe in the future Gemini 1.5. for a specific tool (as it has 1M context window). For example for the Epic + Issue identifier/tool Haiku might be best as its faster and totally capable of doing it. But we might look at Code explanation or generation soon to rather use Opus for example but other tools Sonnet.

@timzallmann

I think moving everything to C3 would be best.

Fully agree, that will be part of Centralize Instant Model Setting and Consider S... (#444664 - closed) • Jeff Park • 17.0 where @jessieay and @jfypk can have that as our base setting and then we can specific different sub models as needed, but all in all C3 is the way forward!

Different topic: conversation history limits.

For completions API, the prompt length must be 100,000 tokens or less.

Today, on Claude 2.1 + completions API, we truncate the conversation history if it is over 30,000 characters long. 30,000 characters is far under 100,000 tokens.

From this very handy conversation where the limit was first added 9 months ago, I see that the rationale was:

with current Anthropic model we use by default (claude1.3 it seems) the prompt limit is at ~9K tokens, which is ~36K chars.

So basically, we added this limit when we were on a very old version of Claude and never updated it.

Further, we limit chat storage to 50 messages, and the length limit for each message is 20,000 characters, which is about 4,100 tokens (code). 50 * 4_100 = 205_000. So the maximum amount of history we would save would be about 200k tokens.

For Claude 3, the context window limit is 200k tokens. Should we allow all chat history up to the full context window length?

cc @achueshev @tlinz

Added an issued dedicated to discussing this: Discussion: Message length limits for Duo Chat ... (#452608 - closed)

For Claude 3, the context window limit is 200k tokens. Should we allow all chat history up to the full context window length?

I would assume yes, @katiemacoy @tlinz would there be a reason not to do this?

This feels like a change that could dramatically impact all the features. I think this should be explored in a separate experiment. My immediate thought is that messages within the last say 1 hour might be relevant to chat. Basically within the same session. We also have to remember that context is shared between the web UI and the Extension UI, which may further add to the possible confusion of more message history.

it's a really interesting idea for sure that we i think should explore, but maybe let's seperate that into a future change.

within the last say 1 hour might be relevant to chat

From my daily experience using Duo Chat, I agree with 1 hour. For beginners in their Chat adoption journey, 1 day to 1 week will be helpful. Or we implement a chat export function, so users can document the learned prompts in wikis, issues and Markdown handbooks. Similar to GitLab communication and "Slack is not a knowledge base", we could apply that pattern to Duo Chat as a best practice :)

Created a proposal Add button to export history in Duo Chat (Markd... (#454171)

This goes to say, most of the chat tools today offer the ability to create multiple conversations where a user can curate for themselves their context per conversation. That might be a much more impactful way to leverage history. But of course is much more complex. We should experiment with this before investing

mentioned in commit e6af6a94

mentioned in commit 1f44c61d

mentioned in commit gitlab-org/modelops/applied-ml/code-suggestions/ai-assist@bb9c423a

mentioned in merge request gitlab-org/modelops/applied-ml/code-suggestions/ai-assist!702 (merged)

mentioned in commit gitlab-org/modelops/applied-ml/code-suggestions/ai-assist@23a0e0dc

mentioned in issue #452608 (closed)

mentioned in commit 70cdf1f8

mentioned in commit e09a0d50

mentioned in commit gitlab-community/gitlab@bcdd5b17

I've enrolled the Duo Chat daily test run user on gitlab.com into Claude 3.

This means that the daily test run for 2024-04-09 will show results for the Duo chat zero shot prompt with Claude 3. All test runs before that date will use Claude 2.1.

If we are not happy with the test results, we can iterate on the Claude 3 prompt construction. If we think they are an improvement over 2.1, we will begin rolling out the feature flag to all gitlab.com users and likely remove the feature flag altogether for %16.11.

I would suggest that mark this issue as complete once we've confirmed that the results for Claude 3 for zero shot are positive. Then, we can create separate issues for migrating separate tools from the older Claude models to Claude 3 / messages API

cc @juan-silva @oregand

@jessieay

Thats amazing news! Overall I think your suggestion and approach here makes total sense, LGTM!

/cc @pwietchner @tlinz @timzallmann

@jessieay when you say daily test runs, do you mean this? Where do I see them?

@m_gill I believe that looker link is the correct aggregate report for the daily test runs. The graphs show trends over time based on the daily runs on gitlab.com but I do not know how to access the raw data for each daily run.

mentioned in merge request !149215 (merged)

Update Duo Chat to use Anthropic Claude 3 Sonnet (rather than 2.1)

Problem to solve

Designs

Child items ...

Activity

Update Duo Chat to use Anthropic Claude 3 Sonnet (rather than 2.1)

Problem to solve

Is blocked by

Relates to

Activity