Add tests to CI that evaluate Code Suggestion responses
As part of &11871, we want expand Code Suggestions automated test coverage that can run in MRs.
We should add new RSpec tests that evaluate the quality of responses Code Suggestions provides to users (similar to the work in &11782 (closed) for GitLab Duo Chat. See !134610 (merged) for the test implementation for Duo).
At this point we should aim for "good enough" evaluation that could detect significant problems, but wouldn't necessarily pick up all low-quality responses. We could use the snippets in code-suggestion-scenarios
as a first iteration. Subsequent work can improve the evaluation
Unlike Duo Chat, Code Suggestions depends on the AI-gateway for real responses from third-party AI providers. This complicates the test setup.
Proposal
Deploy a test environment as a GDK instance and the AI-gateway in Docker containers and run the code-suggestion-scenarios
test script against it.
- Add jobs to
code-suggestion-scenarios
to:- Launch a test environment using GDK and the AI-gateway in Docker containers (build from changes in the MR and the latest AI-gateway commit/version)
- Run the test script against that test environment
- gitlab-com/create-stage/code-creation/code-suggestion-scenarios!19 (merged)
- Add a job to
gitlab-org/gitlab
that triggers a multi-project pipeline oncode-suggestion-scenarios
with those jobs☝ !137611 (merged) - Display a report in the MR that started the pipeline, like in !134610 (comment 1619643333)
- Add a label (e.g. ~"pipeline:cs-response-tests") that can be used to include the tests when required so that they're not automatically run all the time.
- Detect when relevant code changes are made and add a discussion with a message for the MR author to consider using the label to include the tests at least once before the MR is merged.
Pros:
- Most of the setup already exists, except the pipeline changes
- Integrates
gitlab-org/gitlab
,code-suggestion-scenarios
, and the AI-gateway projects into a testing pipeline without duplicating code or responsibilities.
Cons:
- Building GDK takes about 8 mins so it won't necessarily be infeasibly slow, but running the tests will add significant time as well.
Alternatives considered
Option 2
Implement the testing scripted in code-suggestion-scenarios
as RSpec tests that run in CI.
- Add the AI-gateway to the RSpec test environment (similar to Gitaly)
- Port the
code-suggestion-scenarios
test script to this project as RSpec tests and helpers - Use snippets from
code-suggestion-scenarios
as input - clone that repo when running the tests - Add CI jobs to
gitlab-org/gitlab
to run the tests
Pros:
- Should be faster than option 1
Cons:
- Adds test setup for the AI-gateway to
gitlab-org/gitlab
but the AI-gateway isn't release to users (but it is public) - Duplicates
code-suggestion-scenarios