Skip to content

Codestral for Chat

Everyone can contribute. Help move this issue forward while earning points, leveling up and collecting rewards.

This issue is to evaluate all Chat use cases with Codestral models

Model

Platforms

  • vLLM
  • AWS Bedrock
  • Azure AI

Details

  • The problem we are currently facing is described in the following support ticket:https://support.gitlab.com/hc/en-us/requests/638594
  • The problem is on the AI-Gateway side. It's receiving an error because of a message sequence that the model doesn't understand. I have attached the content of the exchange and the error logs for your reference, which I have also tested with models hosted by Mistral.
  • I spoke with the Mistral team, and if I understood them correctly, they believe the issue is that the gateway finishes the conversation with the assistant role, when it should be using the user role for the last message. While I don't have deep expertise in these subjects, my own research supports this conclusion, as finishing with the assistant role is not supported by Mistral models.
  • feedback from Mistral: "We tried using Codestral as a chat model with Randstad, using the "custom" model template. I debugged the LLM call, and the fix is actually quite simple to implement if you want to support Codestral: just add a prefix: true in the body of the last assistant message. Our API requires this if you'd like to prefill the assistant's answer with specific tokens."

Definition of Done

  • Codestral can be used to support all GA Chat feature on all supported platforms
  • Examine individual inputs and outputs that scored poorly (1-2 scores); Look for and document any patterns of either poor feature performance or poor LLM judge callibration. Iterate on the model prompt to eradicate patterns of poor performance.
  • Achieve less than 20% poor answers (defined as 1s and 2s from an LLM judge, or less than 0.8 cosine similarity) using each supported model for those areas in which we do have supporting validation datasets.
  • The traffic light system for self-hosted models has been updated to include quality scores, and the documentation has been updated to reflect any changes
    • the workbook for ER showing those scores has been linked either here or in a comment
Edited by Susie Bitters