Eval coverage (and gaps) for Duo Workflow / Agentic Duo Chat

Everyone can contribute. Help move this issue forward while earning points, leveling up and collecting rewards.

We should write down (docs? readme?) exactly what SWE bench evals cover and what they don't. We can also start by just writing it down in this issue and we can go from there.

Datasets and Evaluators we have

  1. All datasets listed here: https://datasets-gitlab-org-modelops-ai-model-validation-b35d3d2afe403e.gitlab.io/#coverage
  2. AI Gateway CI: sanity-tests "Runs SWE Bench tests on a small set of problems that are consistently resolved by Duo Workflow" - this job runs on all MRs to AI Gateway

What we can evaluate today

What we cannot evaluate today (gaps)

  1. Disambiguation - step where human user clarifies the task
  2. Anything other than Python

What evaluations engineers should run

Before merging a change to the AI Gateway that affects Duo Workflow or Agentic Duo Chat, engineers should run the following evaluations:

Edited by 🤖 GitLab Bot 🤖