Research: OS Models for Conversational Handling
Background
When users engage with Duo Chat, a zero-shot agent is used to determine the routing of the user's input. The zero-shot agent directs the user's query as appropriate to one or many of a series of specialized prompts (aka tools).
- Zero Shot Agent: this prompt defines for the LLM the tools it may choose from (defined below), directs a ReAct Chain-of-Thought, parses the final LLM response, defines default error messages, sets resource limits (10% of total limit).
Proposal
Using the procedure outlined in Document how to identify Open Source Models for... (#461070), Identify one or many OS models with strong performance in conversational handling. The HuggingFace Chatbot Arena Leaderboard may be a good leaderboard to start with, for this research.
- Baseline those identified OS models, as well as already-supported models (currently Mistral, Code Gemma), performances on existing chat datasets and pipelines within the CEF
- As necessary, use tracing to drill down and determine if OS model(s) can correctly choose the relevant tool based on the ReAct prompt
- Document baseline performances for tool selection, and select models for support by Gitlab for self-hosted Chat.
Definition of Done
One or several OS models have been identified to support summarization within Chat, and we have baselines of those model performances documented.
- Baseline those model performances on all existing Chat datasets/pipelines within the CEF
- Document baseline performances within the issue description and BigQuery
- model baseline runs and any full feature runs through prompt iteration should be pushed to BigQuery > dev-ai-research-0e2f897 > custom_models to enable dashboard view of the model/feature performance
- identify models with strong performances on the task and generate issue for prompt creation/iteration
- to be considered for support the baseline model should have high quality scores across all Chat tasks, reflecting that the correct tool was chosen per task. The 3.6 quality score on a scale to 1-4. After prompt iteration, the feature is expected to reach at least a 3.8.
- to understand the basis for this minimum quality threshold, reference foundational model performances for the task on the Chat dashboard
- to be considered for support the baseline model should have high quality scores across all Chat tasks, reflecting that the correct tool was chosen per task. The 3.6 quality score on a scale to 1-4. After prompt iteration, the feature is expected to reach at least a 3.8.
Edited by Susie Bitters