Draft: Add Vertex Gecko Confidence Score (!142) · Merge requests · GitLab.org / ModelOps / AI Model Validation and Research / AI Evaluation / Prompt Library

Dylan Bernardi requested to merge db/add-confidence-score-gecko into main Nov 18, 2023

What does this merge request do and why?

The issue this MR aims to solve (partially). The documentation describing the value being added can be found here.

This MR introduces confidence_score as a value returned by model calls to VertexCodeModelHandle. The way that this addition is being proposed is by altering the way that model call responses are handled. Before this MR, all model calls were returning a string value, "completion", that is the text the model returns. This, I am proposing, no longer works when needing to return more values from a model call as is in this case with all Vertex Code Completion models.

I first approached this by adding another string value to be returned by VertexCodeModelHandle. However, this involved not only making unique changes to just one class of models, but also required another abstract class to handle the uniqueness and introductions of more if/else logic in the main pipeline. Overall, I found trying this approach rather hard to work on (and likely hard to maintain in the future).

The current approach creates a general ModelResponse class that is returned in a list by the base model class. The ModelResponse class contains a completion: str value and will allow for a multitude of values to be returned from various different classes of models by making the values optional. This abstraction can be applied to all model classes and seamlessly integrated into the pipeline as has already been done in this MR.

State of this MR - for Handoff

There are two areas where I think this MR is missing work:

I struggled with how to approach outputting this value into the Biguery table. I tried storing this value in one of the Chunks or TestCases so that it can be accessed at the finale of the pipeline run and be included in the BigQuery data table, but it was not fruitful. The JSON block describing this field is below with a note.

Not done: Output confidence_score value to BigQuery dataset. The value is being collected as a STRING in the ModelResponse class and the plan was to convert the value before being sent to BigQuery. (Although now that I am writing this, it might be a better idea to have the default value be a float)

            {
                "name": "confidence_score",
                "type": "FLOAT64",
                "mode": "REQUIRED",
                "description": "A float value that's less than zero. The higher the value for score, the greater confidence the model has in its response. Only for VertexCodeCompletion models.",
            },

Tests are failing. Because of the change to the response values being returned, the tests in test_eval_codebase.py are failing. I am holding off fixing those tests before getting a census on how this MR will be accepted/denied. If the idea is too rash for this change, no need for fixing tests in the first place.

How to set up and validate locally

Numbered steps to set up and validate the change are strongly suggested.

Merge request checklist

I've ran the affected pipeline(s) to validate that nothing is broken.
Tests added for new functionality. If not, please raise an issue to follow up.
Documentation added/updated, if needed.

Edited Nov 18, 2023 by Dylan Bernardi

Draft: Add Vertex Gecko Confidence Score

What does this merge request do and why?

State of this MR - for Handoff

How to set up and validate locally

Merge request checklist

Merge request reports