Code Embeddings Model Evaluations (#17749) · Epics · GitLab.org

Code Embeddings Model Evaluations

## Context We need to evaluate the different Embeddings Generation models for performance, as well as effectiveness of the search result as additional context. This epic is specifically for evaluating the performance of different embeddings models. To evaluate the effectiveness of the search result for context enhancement, please refer to https://gitlab.com/groups/gitlab-org/-/epics/17750+ ## References Prior art for evaluations: - https://gitlab.com/groups/gitlab-org/-/epics/16173+ - https://gitlab.com/groups/gitlab-org/-/epics/16672+ ## Proposal We will make use of the [Prompt Library](https://gitlab.com/gitlab-org/modelops/ai-model-validation-and-research/ai-evaluation/prompt-library) and [ELI5](https://gitlab.com/gitlab-org/modelops/ai-model-validation-and-research/ai-evaluation/prompt-library#gitlab-eval-like-im-5-project) framework to run evaluations. ### Evaluation script In order to run evaluations for the different embeddings models, we first need to add support for said evaluations. The steps we need to do are: 1. [Add script for evaluating the embeddings generation endpoint](https://gitlab.com/gitlab-org/modelops/ai-model-validation-and-research/ai-evaluation/prompt-library/-/issues/715) - given a dataset, this script will loop through each record and send a request to the embeddings generation endpoint in AIGW 2. [Create dataset for code embeddings generation evaluation ](https://gitlab.com/gitlab-org/modelops/ai-model-validation-and-research/ai-evaluation/prompt-library/-/issues/716) - this dataset must reflect the data that we are generating embeddings for, so this would be code chunks/snippets 3. [Add the embeddings generation eval script to the evaluation-runner pipeline](https://gitlab.com/gitlab-org/modelops/ai-model-validation-and-research/ai-evaluation/prompt-library/-/issues/717) - this allows us to run the evaluation on a GCP instance; while we can run evaluations locally, running it on a GCP instance means we can have more consistent latency/performance results ### Running evaluations 4. [Run latency evaluations for the textembeddings model](https://gitlab.com/gitlab-org/modelops/ai-model-validation-and-research/ai-evaluation/prompt-library/-/issues/718) - Once 1-3 are done, we can run the actual evaluations. The first evaluation we'll do is for the Vertex AI's `textembeddings` model. - We should be able to get latency numbers from this ### Evaluating other models Once we introduce other embeddings models, we might need to update the script introduced in Step 1 to accommodate a new endpoint (if necessary), then run evaluations specific to that model

epic