Create /tests evaluator and register evaluation command in CEF

Context

/tests is a Duo Chat IDE slash command that generates unit tests for the user's selected code.

In &16634, our objective is to establish an evaluation process to help us assess and monitor the accuracy of creating tests with /tests, particularly as we evaluate new models or new versions of models.

For this issue, we can utilize the dataset created in #515914 (closed) to help us craft the evaluator for /tests (it can be worked on in parallel).

Proposal

Determine evaluation criteria for this command. Examples:

Test coverage (does the output cover all major workflows and edge cases?)
Others?

Design and implement the evaluator(s) in CEF.

If more than one evaluator is required, you may want to separate it into another issue.

It may be helpful to reference the code in gitlab-org/modelops/ai-model-validation-and-research/ai-evaluation/prompt-library!985 (diffs).

Edited Jan 30, 2025 by Leaminn Ma