AI Model Validation : Solution Validating Test Harness to test chat at scale
Problem to solve
Specific to chat use-case understand how to be able to quantitaively evaluate at scale when comparing with Claude and text-bison , or other models.
Proposal
We consider claude as the ground truth for the answers and would like to verify how similar the answer from GitLab chat is to claude in a quantitative measure.
- Iteration 1 : Validate if we can use our similarity algorithm for this use-case by running our algorithm on a set of prompts and comparing with claude , chat , text-bison for 5 prompts @tlinz ---> @bcardoso-
- Iteration 2 : If that works , run it for 50 prompts by manually adding the chat output as there is no API.
- Iteration 3: TBD
Further details
TBD
Links / references
- Sync meeting for collaboration : https://www.youtube.com/watch?v=9p1neCwYVWU
- Google Doc: https://docs.google.com/document/d/1l6tBLICSTL3dWb_7OvRXCKmaF4LaegUQAlRlNWkV_Zs/edit
- AI Model Validation Team meeting discussion with Torsten on this: https://docs.google.com/document/d/1NTHrFHxNLzG_kn69tiYuO0dSmUq2y_nOdBXYtOBTVA8/edit
- Reference to Issue : Create test framework that includes testing the... (gitlab-org/gitlab#422245 - closed)
- Awesome Torsten Spreadsheet
Edited by Mon Ray