Define process for evaluating new code suggestions models
We are seeing that new large language models are being introduced to the market and released every few weeks. For both code completion and code generation, we want to be sure we are using the best models available for the jobs to be done. To do this, we should have a defined and reusable process to explain how we test and evaluate any new model we are interested in.
Considerations
Here are some ideas we should consider for evaluation:
- Availability
- Is this for code completion or code generation?
- Results from groupai model validation including quality of results and latency
- Language support
- Context window size
- Industry benchmarks
Process Steps
Here are some things that could be part of the process:
- Create an issue to track and record findings
- Work with groupai model validation to add any new models to their testing. Example: gitlab-org/modelops/ai-model-validation-and-research/ai-evaluation/prompt-library#287
- Deploy a version of the model (if needed)
- Run any additional tests (if any)
- Document the results
- Make a recommendation if we should use it
Process Definition
The outcome of this issue should be a documented process. This could be in the handbook or as an issue template. The goal would be able to re-use that process for any new models so we can make data-based decisions and also to have a record of our investigations. We could consider creating a new GitLab project to keep these
We should test the process with some of the recent evaluation we did in https://gitlab.com/gitlab-org/gitlab/-/issues/455319+.