Add tests to CI that evaluate Code Suggestion responses

As part of &11871, we want expand Code Suggestions automated test coverage that can run in MRs.

We should add new RSpec tests that evaluate the quality of responses Code Suggestions provides to users (similar to the work in &11782 (closed) for GitLab Duo Chat. See !134610 (merged) for the test implementation for Duo).

At this point we should aim for "good enough" evaluation that could detect significant problems, but wouldn't necessarily pick up all low-quality responses. We could use the snippets in code-suggestion-scenarios as a first iteration. Subsequent work can improve the evaluation

Unlike Duo Chat, Code Suggestions depends on the AI-gateway for real responses from third-party AI providers. This complicates the test setup.

Proposal

Deploy a test environment as a GDK instance and the AI-gateway in Docker containers and run the code-suggestion-scenarios test script against it.

Add jobs to code-suggestion-scenarios to:
- Launch a test environment using GDK and the AI-gateway in Docker containers (build from changes in the MR and the latest AI-gateway commit/version)
- Run the test script against that test environment
- gitlab-com/create-stage/code-creation/code-suggestion-scenarios!19 (merged)
Add a job to gitlab-org/gitlab that triggers a multi-project pipeline on code-suggestion-scenarios with those jobs ☝ !137611 (merged)
Display a report in the MR that started the pipeline, like in !134610 (comment 1619643333)
Add a label (e.g. ~"pipeline:cs-response-tests") that can be used to include the tests when required so that they're not automatically run all the time.
Detect when relevant code changes are made and add a discussion with a message for the MR author to consider using the label to include the tests at least once before the MR is merged.

Pros:

Most of the setup already exists, except the pipeline changes
Integrates gitlab-org/gitlab, code-suggestion-scenarios, and the AI-gateway projects into a testing pipeline without duplicating code or responsibilities.

Cons:

Building GDK takes about 8 mins so it won't necessarily be infeasibly slow, but running the tests will add significant time as well.

Alternatives considered

Option 2

Implement the testing scripted in code-suggestion-scenarios as RSpec tests that run in CI.

Add the AI-gateway to the RSpec test environment (similar to Gitaly)
Port the code-suggestion-scenarios test script to this project as RSpec tests and helpers
Use snippets from code-suggestion-scenarios as input - clone that repo when running the tests
Add CI jobs to gitlab-org/gitlab to run the tests

Pros:

Should be faster than option 1

Cons:

Adds test setup for the AI-gateway to gitlab-org/gitlab but the AI-gateway isn't release to users (but it is public)
Duplicates code-suggestion-scenarios

Edited Dec 07, 2023 by Mark Lapierre