Define process for evaluating new code suggestions models

We are seeing that new large language models are being introduced to the market and released every few weeks. For both code completion and code generation, we want to be sure we are using the best models available for the jobs to be done. To do this, we should have a defined and reusable process to explain how we test and evaluate any new model we are interested in.

Considerations

Here are some ideas we should consider for evaluation:

Availability
Is this for code completion or code generation?
Results from groupai model validation including quality of results and latency
Language support
Context window size
Industry benchmarks

Process Steps

Here are some things that could be part of the process:

Create an issue to track and record findings
Work with groupai model validation to add any new models to their testing. Example: gitlab-org/modelops/ai-model-validation-and-research/ai-evaluation/prompt-library#287
Deploy a version of the model (if needed)
Run any additional tests (if any)
Document the results
Make a recommendation if we should use it

Process Definition

The outcome of this issue should be a documented process. This could be in the handbook or as an issue template. The goal would be able to re-use that process for any new models so we can make data-based decisions and also to have a record of our investigations. We could consider creating a new GitLab project to keep these

We should test the process with some of the recent evaluation we did in https://gitlab.com/gitlab-org/gitlab/-/issues/455319+.

Edited May 10, 2024 by Matt Nohr