Eval coverage (and gaps) for Duo Workflow / Agentic Duo Chat

Everyone can contribute. Help move this issue forward while earning points, leveling up and collecting rewards.

Close this issue

We should write down (docs? readme?) exactly what SWE bench evals cover and what they don't. We can also start by just writing it down in this issue and we can go from there.

Datasets and Evaluators we have

All datasets listed here: https://datasets-gitlab-org-modelops-ai-model-validation-b35d3d2afe403e.gitlab.io/#coverage
AI Gateway CI: sanity-tests "Runs SWE Bench tests on a small set of problems that are consistently resolved by Duo Workflow" - this job runs on all MRs to AI Gateway

What we can evaluate today

What we cannot evaluate today (gaps)

Disambiguation - step where human user clarifies the task
Anything other than Python

What evaluations engineers should run

Before merging a change to the AI Gateway that affects Duo Workflow or Agentic Duo Chat, engineers should run the following evaluations:

Edited Jul 02, 2025 by 🤖 GitLab Bot 🤖