Local Model Baselining and Prompt Development for Self-Hosted Models

Background

The Custom Models teams is leveraging datasets maintained in the Centralized Evaluation Framework in order to test and iterate on OS models as foundation for self-hosted Duo Chat features. Additional information on AI Model Validation team's support to Custom Models can be found here, with specific information on validation processes found here.

Non-self-hosted Duo features are supported by 3rd party models and have their features captured in production code, enabling the AI Model Validation team to baseline the feature's performance and capture changing performance of those features via a daily run. This daily run leverages the totality of applicable CEF prompt datasets for each feature to gain an understanding of that feature's performance at scale.

As Custom Model features are not in production code, is it not currently possible for the AI Model Validation team to create a daily run for the performance of self-hosted Duo features on all OS models in development.

Solution

When considering a new OS model for support in self-hosted models, the Custom Models team will:

Local Model Baselining

leverage any existing AI Model Validation baselines for the OS model of interest
if that baseline does not exist, the Custom Models team will host the OS model on GDK and run the totality of applicable datasets from the CEF to create a baseline understanding of the model's performance for each feature and its use case
1. baseline runs and any subsequent full feature runs through prompt iteration should be pushed to BigQuery > dev-ai-research-0e2f897 > custom_models to enable dashboard view of the model/feature performance
use that baseline to consider whether or not to support that OS model for each self-hosted feature and sub-feature

Local Model Prompt Development

leverage metrics (cosine similarity, LLM judge rankings (1-4) to focus in on poor performance prompts/responses and identify patterns of weakness within candidate OS models responses
examine the individual prompt inputs/outputs to identify patterns of poor performance
create subsets of data (ideally 20+ prompts) that are demonstrative of those identified patterns; these will most likely be extracted from a single user-case/dataset
experiment with prompts focused on those subsets of data to achieve performance gains
once a new prompt has been identified that affects performance improvements, run against a broader set of the data to ensure that it doesn't degrade performance in other use-cases
- this may not necessarily be the entire array of applicable validation datasets for the feature, but may be instead 20-50 prompts from each use-case; this can, however, be the complete dataset from the CEF as well
periodically re-validate the performance of the feature on the full versions of all applicable datasets locally, to ensure that performance gains in one area have not adversely affected performance in another area
1. remember to push full dataset runs to BigQuery > dev-ai-research-0e2f897 > custom_models to enable dashboard view of the model/feature performance
before shipping a prompt for self-hosted models feature use, the total validation dataset suite will be run a final time and the performance on each dataset will be documented
unless there are further changes to the prompt, feature, or configurations, there will be no reason to further run the entire dataset

Genesis

Moving a discussion from #g_custom_models

Summary

groupcustom models should be running the entire dataset when adding new prompts
The self-hosted models are not included in the daily runs (only GitLab managed models)

Background

Messages from Slack

when setting up our prompting and validation for the OS models, in some instances we have only been basing our quality assessment on the subsets of data within the CEF for each use case. As we move forward, we need to ensure that we are using both the subsets AND larger representative samples of the data (to include periodic runs of the whole use case/feature dataset) to ensure that we have a thorough understanding of the model's performance with the prompt. This will allow us to dig down on and identify any patterns of weakness that we can shore up with our prompting. I realize this is contrary to AI Model Validation documentation, as the team usually discourage feature teams from running running the entire set locally, but that is for features that already have a complete daily run in which that work would be entirely redundant. For our features we will need to run the full sets locally for at least the initial full baseline and then periodically to validate any changes we make so we have a sense of the quality based on Mistral, Code Gemma, etc

If the intent is to monitor production, I don’t understand the point of doing daily runs directly against the model providers and not against our product’s APIs (Code Suggestions or Duo Chat). I think each individual team should setup production monitoring for their own use case and not depend on one single team. In my opinion, such approach is error-prone and creates an unnecessary bottleneck.

With custom models, we have total control over which model is being used. There are no surprise updates to the model being served, which can happen with the API-based LLMs (Anthropic). The model, once deployed, is not going to change its weights, so running the same dataset against it over and over makes even less sense.

If the intent of the daily runs is to catch hiccups with the LLMs being served, the current approach with the daily runs will only test the provider during the length of the daily run, so for the rest of the day, if anything happens on the provider side we won’t really know until the next day.

If the daily runs are correctly set to run against our APIs, the rate limit of such APIs might be the main reason on why it takes so long. For example, a local run against a branch is actually much faster than running against gitlab.com, since the latter you will be rate limited to 60 requests/s. When running locally, one can set this rate limit much higher.

Action: Moving forward, its recommended to integrate both subsets and larger, representative samples of data into the evaluation process of Custom Models. This includes periodic runs of the entire use case or feature dataset to ensure a thorough understanding of the model's performance with prompts. This adjustment will enable to identify and address any recurring weaknesses effectively.

we should be experimenting and validating with the whole available datasets when making our determination about prompt/feature readiness. Bruno, i agree with you that the test sets as they are created are not particularly relevant to our application, as they are based off of another model's/feature's problem areas. The daily runs are currently for features supported by GL managed models, and are not OS self hosted models, and therefore irrelevant to us. That is why we have reason to run the full datasets locally while other teams do not. We would only run the whole dataset to validate prompt changes that we have made.

Edited Sep 19, 2024 by Susie Bitters