Research: OS Models for Conversational Handling

Background

When users engage with Duo Chat, a zero-shot agent is used to determine the routing of the user's input. The zero-shot agent directs the user's query as appropriate to one or many of a series of specialized prompts (aka tools).

Zero Shot Agent: this prompt defines for the LLM the tools it may choose from (defined below), directs a ReAct Chain-of-Thought, parses the final LLM response, defines default error messages, sets resource limits (10% of total limit).

Proposal

Using the procedure outlined in Document how to identify Open Source Models for... (#461070), Identify one or many OS models with strong performance in conversational handling. The HuggingFace Chatbot Arena Leaderboard may be a good leaderboard to start with, for this research.

Baseline those identified OS models, as well as already-supported models (currently Mistral, Code Gemma), performances on existing chat datasets and pipelines within the CEF
As necessary, use tracing to drill down and determine if OS model(s) can correctly choose the relevant tool based on the ReAct prompt
Document baseline performances for tool selection, and select models for support by Gitlab for self-hosted Chat.

Definition of Done

One or several OS models have been identified to support summarization within Chat, and we have baselines of those model performances documented.

Baseline those model performances on all existing Chat datasets/pipelines within the CEF
Document baseline performances within the issue description and BigQuery
- model baseline runs and any full feature runs through prompt iteration should be pushed to BigQuery > dev-ai-research-0e2f897 > custom_models to enable dashboard view of the model/feature performance
identify models with strong performances on the task and generate issue for prompt creation/iteration
- to be considered for support the baseline model should have high quality scores across all Chat tasks, reflecting that the correct tool was chosen per task. The 3.6 quality score on a scale to 1-4. After prompt iteration, the feature is expected to reach at least a 3.8.
  - to understand the basis for this minimum quality threshold, reference foundational model performances for the task on the Chat dashboard

Edited Jul 05, 2024 by Susie Bitters