Test validation set across models for /tests
This issue was created based on the /refactor
counterpart: Test Validation Dataset Across Models for /refa... (#512598 - closed)
Context
In &16634, our objective is to establish an evaluation process to help us assess and monitor the accuracy of creating tests with /tests
, particularly as we evaluate new models or new versions of models.
Following the completion of #515914 (closed) and #515921 (closed), we now want to run the /tests
evaluation across different models and assess the results. This will allow us to confirm the evaluation's effectiveness and identify areas for improvement.
Proposal
- Select diverse models (mistral, Claude, GPT).
- Test models using the
/tests
validation dataset from #515914 (closed) and the evaluator implemented in #515921 (closed). - Analyze results and identify gaps.
- Provide recommendations for dataset enhancement.
Definition of Done
- Testing completed on 3+ models.
- Results documented with insights.
- Feedback shared for dataset refinement.
Edited by Leaminn Ma