Test validation set across models for /tests

This issue was created based on the /refactor counterpart: Test Validation Dataset Across Models for /refa... (#512598 - closed)

Context

In &16634, our objective is to establish an evaluation process to help us assess and monitor the accuracy of creating tests with /tests, particularly as we evaluate new models or new versions of models.

Following the completion of #515914 (closed) and #515921 (closed), we now want to run the /tests evaluation across different models and assess the results. This will allow us to confirm the evaluation's effectiveness and identify areas for improvement.

Proposal

Select diverse models (mistral, Claude, GPT).
Test models using the /tests validation dataset from #515914 (closed) and the evaluator implemented in #515921 (closed).
Analyze results and identify gaps.
Provide recommendations for dataset enhancement.

Definition of Done

Testing completed on 3+ models.
Results documented with insights.
Feedback shared for dataset refinement.

Edited Jan 30, 2025 by Leaminn Ma