Claude 3.0 requires use of the Messages API. When I send a request to any of the Claude 3 models using our existing AI Gateway logic, I get this error:
{'type': 'invalid_request_error', 'message': '\"claude-3-sonnet-20240229\" is not supported on this API. Please use the Messages API instead.'}
We have an existing issue for upgrading the AI Gateway to use the Anthropic Messages API but today we still use the Text completions API. So, this seems like the move to Claude 3 will require more work than the Claude 2.0 -> 2.1 upgrade.
I will dig a bit more and report back on how much work it appears to be. The migrations guide makes it sound pretty straightforward but I am not sure what mechanisms we have in the AI Gateway for making changes
Confirmed via Anthropic docs that we need to use messages API:
Our latest and most powerful Claude 3 models (Haiku, Sonnet, and Opus) can only be called via the Messages API. By upgrading, you'll be able to take advantage of their enhanced performance and capabilities.
Once this MR is merged, the next step would be to update the monolith to, behind a feature flag, send the Claude 3 model name as a header (similar to what we did for Claude 2.1) so that we could opt specific users into Claude 3 for local testing. We should also work with Model Validation to see how we could run the full prompt library test suite for Claude 3 to compare stats.
Thats awesome @jessieay ! Thank you so much for your awesome work here!
Once this MR is merged, the next step would be to update the monolith to, behind a feature flag, send the Claude 3 model name as a header (similar to what we did for Claude 2.1) so that we could opt specific users into Claude 3 for local testing.
This is outstanding!
We should also work with Model Validation to see how we could run the full prompt library test suite for Claude 3 to compare stats.
Do you think we could look to scope the the full prompt library test suite against a single user that we toggle a feature flag on for these type of scenarios?
Would you be keen to take the lead on run the full prompt library test suite for Claude 3 to compare stats.? We are eager to get claude-3 evaluated ASAP so we can look to migrate sooner rather than later if the gains are significant
This is to be able to locally switch the claude 3 family both models api with claude 2.1 for duo chat and measure any impact for the subset of data where the control is the current set up of duo-chat. Super excited for this . More details in the issue.
Further any support needed on the experiment, @tle_gitlab would be able to support from the model validation team
@oregand yes I already have a branch with most of this work done from when I was working on this before Summit, will update it when I am back at work on Wednesday (tomorrow). Excited to give this a try!
The branch has the ability to use any of the 3 claude 3 models behind a feature flag. I confirmed with @tle_gitlab that the production model eval daily test runs for Duo Chat use a GitLab PAT belonging to a test user.
So, because the full model eval suite only runs on prod, to compare results for each model we can either:
do a separate / additional full prod test run with a different test user and flag each test user into the feature flag for each model
use existing daily prod test run and time the feature flag updates so that the existing test user is using a different Claude 3 model each day for 3 days.
First approach is more systematic and might provide a blueprint for how to test model changes going forward. Second approach is more manual but requires less upfront investment. Will make sure I sync with Model Validation on which way to go.
@oregand + @jessieay + @bcardoso : How about focusing purely on using sonnet as the main drop in replacement model as a starting point. Also something that Anthropic suggests. To reduce the amount of evaluation into submodels for faster iteration. After that we can see if Haiku or Opus might be better models for some tools/use cases by experimenting isolated per tool. Wdyt?
I did some tests last week and code creation too before with the pre-release. Haiku performs badly for the 0-shot agent and complex code generation tasks for example but might be good enough for summarising etc. but that improvement still can come in another step. The win of just having C3-Sonnet as a new baseline is already great step forward.
The biggest impact I saw apart from moving to Claude3 was mainly seperating the instruction / system prompt out to support better conversational questions. And also following some of the currently not performing well issue/epic questions got from 0% to 100%. Do you plan to do this already in the first go or additional iterations?
We can't do a "Drop in replacement" for any of the Claude 3 models right now because our prompts, which are hard-coded in the monolith, use a format that works for the Anthropic completions endpoint. The Messages endpoint was supported for Claude 2 models but we never adopted it. For Claude 3, we must us the Messages endpoint.
So, the work we need to do for Claude 3 in the monolith right now is really to build support for the Messages API. Once we have that support, we can start testing Claude 3 models for Chat and it sounds like Sonnet is the best one to start with.
Thanks @jessieay, yes I am aware that extra work is needed :-/ for taking care of it.
I wasn't clear enough what i meant with Drop-In, that I meant only about the models prompt/results/etc. As i saw in the other doc regarding evaluating all different Claude 3 types as a base model, i think we can skip that and use sonnet simply as the first step and then most probably need adaptions per tool.
As i saw in the other doc regarding evaluating all different Claude 3 types as a base model, i think we can skip that and use sonnet simply as the first step and then most probably need adaptions per tool.
This sounds like a great approach to me, I am all for it
One thing I realized as I've dug into this: we don't currently use Claude 2 for everything. Here is the model breakdown for Chat:
Claude 2.1 is the default. We Use it for: Zero shot prompt, Documentation tool, CI Editor assistant, Slash commands
We also use Claude-instant-1.2 for: Epic tool, Issue tool
For this Claude 3 effort, do we want to convert everything to use Claude 3, or just the elements that currently use Claude 2.1? I assume we use Claude Instance for Issues/Epics because that is the best model for those.
@oregand Regarding the split, I expect this again as we will have different sub models for the tools (as long as we use the same haiku version for all :-)) and/or use maybe in the future Gemini 1.5. for a specific tool (as it has 1M context window). For example for the Epic + Issue identifier/tool Haiku might be best as its faster and totally capable of doing it. But we might look at Code explanation or generation soon to rather use Opus for example but other tools Sonnet.
with current Anthropic model we use by default (claude1.3 it seems) the prompt limit is at ~9K tokens, which is ~36K chars.
So basically, we added this limit when we were on a very old version of Claude and never updated it.
Further, we limit chat storage to 50 messages, and the length limit for each message is 20,000 characters, which is about 4,100 tokens (code). 50 * 4_100 = 205_000. So the maximum amount of history we would save would be about 200k tokens.
For Claude 3, the context window limit is 200k tokens. Should we allow all chat history up to the full context window length?
This feels like a change that could dramatically impact all the features. I think this should be explored in a separate experiment. My immediate thought is that messages within the last say 1 hour might be relevant to chat. Basically within the same session. We also have to remember that context is shared between the web UI and the Extension UI, which may further add to the possible confusion of more message history.
it's a really interesting idea for sure that we i think should explore, but maybe let's seperate that into a future change.
within the last say 1 hour might be relevant to chat
From my daily experience using Duo Chat, I agree with 1 hour. For beginners in their Chat adoption journey, 1 day to 1 week will be helpful. Or we implement a chat export function, so users can document the learned prompts in wikis, issues and Markdown handbooks. Similar to GitLab communication and "Slack is not a knowledge base", we could apply that pattern to Duo Chat as a best practice :)
This goes to say, most of the chat tools today offer the ability to create multiple conversations where a user can curate for themselves their context per conversation. That might be a much more impactful way to leverage history. But of course is much more complex. We should experiment with this before investing
I've enrolled the Duo Chat daily test run user on gitlab.com into Claude 3.
This means that the daily test run for 2024-04-09 will show results for the Duo chat zero shot prompt with Claude 3. All test runs before that date will use Claude 2.1.
If we are not happy with the test results, we can iterate on the Claude 3 prompt construction. If we think they are an improvement over 2.1, we will begin rolling out the feature flag to all gitlab.com users and likely remove the feature flag altogether for %16.11.
I would suggest that mark this issue as complete once we've confirmed that the results for Claude 3 for zero shot are positive. Then, we can create separate issues for migrating separate tools from the older Claude models to Claude 3 / messages API
@m_gill I believe that looker link is the correct aggregate report for the daily test runs. The graphs show trends over time based on the daily runs on gitlab.com but I do not know how to access the raw data for each daily run.
Torsten Linzchanged title from GitLab Duo Chat now uses Anthropic Claude 3 Sonnet to GitLab Duo Chat now uses Anthropic Claude 3 Sonnet (rather than 2.1)
changed title from GitLab Duo Chat now uses Anthropic Claude 3 Sonnet to GitLab Duo Chat now uses Anthropic Claude 3 Sonnet (rather than 2.1)
Torsten Linzchanged title from GitLab Duo Chat now uses Anthropic Claude 3 Sonnet (rather than 2.1) to Update Duo Chat to use Anthropic Claude 3 Sonnet (rather than 2.1)
changed title from GitLab Duo Chat now uses Anthropic Claude 3 Sonnet (rather than 2.1) to Update Duo Chat to use Anthropic Claude 3 Sonnet (rather than 2.1)
Torsten Linzchanged the descriptionCompare with previous version