Adjust the mistral prompt for mixtral8x22b (MoE) (!155083) · Merge requests · GitLab.org / GitLab

Mohamed Hamda requested to merge mixtral-22b-prompt into master Jun 03, 2024

What does this MR do and why?

In this merge request, we adjust the Mistral prompt to be Mixtral8x22B (MoE) friendly and eliminate hallucinations.

During the testing of Mixtral8x22B, we encountered some hallucinations caused by the examples we provided.

The MoE was using these examples to generate responses.

For instance, we observed significant hallucinations where the output mentioned "Arkansas" for example:

This term was part of our input examples:

After removing these examples and modifying the prompt to act more as a code generation agent rather than a code completion one.

We achieved much clearer and more accurate results:

Prompt Evaluation Control (~0.91) and Test similarity score (~0.89) previously was ~0.87

Control : 0.91
Current: 0.88769758238512009, Variance: -0.2
Sample Size: 425
Success: Yes, for a mistral family compared to Antrhopic, that should be good.

On GCP

SELECT avg(similarity_score) FROM `dev-ai-research-0e2f8974.code_suggestion_experiments.mhamda_mixtral_22b_20240603_150423__similarity_score` LIMIT 1000

Mistral Prompt Evaluation Control (~0.91) and Test similarity score (~0.86) previously was ~0.87

Control : 0.91
Before: 0.86543818249421944
Current: 0.85687403566696974, Variance: < -0.1
Sample Size: 425
Success: Yes, for a mistral family compared to Antrhopic, that should be good.

On GCP

SELECT avg(similarity_score) FROM `dev-ai-research-0e2f8974.code_suggestion_experiments.mhamda_mistral-2nd-run_20240603_154840__similarity_score` LIMIT 1000

We can definitely iterate on the prompt, and we do have an issue with that, but the prompt works for both mistral and mixtral.

Edited Jun 03, 2024 by Mohamed Hamda

Adjust the mistral prompt for mixtral8x22b (MoE)

What does this MR do and why?

Prompt Evaluation Control (~0.91) and Test similarity score (~0.89) previously was ~0.87

Mistral Prompt Evaluation Control (~0.91) and Test similarity score (~0.86) previously was ~0.87

Merge request reports