Skip to content

AI Model Validation : Solution Validating Test Harness to test chat at scale

Problem to solve

Specific to chat use-case understand how to be able to quantitaively evaluate at scale when comparing with Claude and text-bison , or other models.

Proposal

We consider claude as the ground truth for the answers and would like to verify how similar the answer from GitLab chat is to claude in a quantitative measure.

  1. Iteration 1 : Validate if we can use our similarity algorithm for this use-case by running our algorithm on a set of prompts and comparing with claude , chat , text-bison for 5 prompts @tlinz ---> @bcardoso-
  2. Iteration 2 : If that works , run it for 50 prompts by manually adding the chat output as there is no API.
  3. Iteration 3: TBD

Further details

TBD

Links / references

  1. Sync meeting for collaboration : https://www.youtube.com/watch?v=9p1neCwYVWU
  2. Google Doc: https://docs.google.com/document/d/1l6tBLICSTL3dWb_7OvRXCKmaF4LaegUQAlRlNWkV_Zs/edit
  3. AI Model Validation Team meeting discussion with Torsten on this: https://docs.google.com/document/d/1NTHrFHxNLzG_kn69tiYuO0dSmUq2y_nOdBXYtOBTVA8/edit
  4. Reference to Issue : Create test framework that includes testing the... (gitlab-org/gitlab#422245 - closed)
  5. Awesome Torsten Spreadsheet
Edited by Mon Ray