AI Model Validation : Solution Validating Test Harness to test chat at scale

Problem to solve

Specific to chat use-case understand how to be able to quantitaively evaluate at scale when comparing with Claude and text-bison , or other models.

Proposal

We consider claude as the ground truth for the answers and would like to verify how similar the answer from GitLab chat is to claude in a quantitative measure.

Iteration 1 : Validate if we can use our similarity algorithm for this use-case by running our algorithm on a set of prompts and comparing with claude , chat , text-bison for 5 prompts @tlinz ---> @bcardoso-
Iteration 2 : If that works , run it for 50 prompts by manually adding the chat output as there is no API.
Iteration 3: TBD

Further details

TBD

Links / references

Sync meeting for collaboration : https://www.youtube.com/watch?v=9p1neCwYVWU
Google Doc: https://docs.google.com/document/d/1l6tBLICSTL3dWb_7OvRXCKmaF4LaegUQAlRlNWkV_Zs/edit
AI Model Validation Team meeting discussion with Torsten on this: https://docs.google.com/document/d/1NTHrFHxNLzG_kn69tiYuO0dSmUq2y_nOdBXYtOBTVA8/edit
Reference to Issue : Create test framework that includes testing the... (gitlab-org/gitlab#422245 - closed)
Awesome Torsten Spreadsheet

Edited Oct 16, 2023 by Mon Ray