Test follow-up questions in Chat to ensure it can hold conversational context and also deals well with context switches

Problem to solve

One of the main values of an AI chat is the ability to have follow-up questions (contextual conversational handling) allowing users to iterate towards a goal with the help of AI. Chat currently records a history of the conversation which is sent with each new message. The efficacy of the current strategy is unvalidated and unknown. However, user feedback suggests that Duo Chat is believed to perform poorly on follow-up questions.

However, we currently don't systematically test how chat performs on follow-up questions. The CEF is currently set up to evaluate single query/response exchanges, rather than multiple exchange over time.

Goals

  1. Does Duo Chat hold conversational context? If it doesn't,
    • Understand why conversational context gets lost.
    • Fix Duo Chat so it does not lose conversational context.
  2. Does Duo Chat deal well with conversational context switching by the user? If it doesn't,
    • Understand why context switching does not work in Duo Chat.
    • Fix Duo Chat so it deals well with conversational context switches.

Non-goals

Not a goal in the first iteration: Does Duo Chat contribute to its part to making the conversation a good one achieving these and other goals:

  • Did Chat solve my problem?
  • Did Chat solve it efficiently?
  • Was Chat kind?

Why is this not a goal in the first iteration?

From own experience with the barebones LLM, we observe that it almost never loses the historic context of the conversation. So, it seems our chat system somehow messes that up. That seems to be the biggest problem to solve.

Proposal

Part 1: Data Collection & local evaluation:

When this part is done, developers will have a dataset containing conversations that they can replay locally and automatically evaluate with an LLM judge. Then they can dig into the failures.

  • Use existing conversation datasets & make them replay-able locally:
  • Collecting more data:
    • Change the feedback modal to include “Problem with conversational context (either lost or context switch not understood)” as an option to choose. (who: @shinya.maeda; when: @shinya.maeda will add an ETA)
    • Find a way to produce conversation datasets from this feedback if the user has enabled logging. (who: @shinya.maeda; when: @shinya.maeda will add an ETA)
    • Ask internal users to turn on logging and to give feedback when the conversation fails
  • Judge for evaluating conversations

Part 2: Understanding where it fails and fixing these issues

  • We will do the forensic work to figure out why the chat failed to hold context
  • We will adjust prompts or whatever needs to be adjusted to address the problem
  • We may be able to use an LLM for both the analysis of the failure and changing the prompt

Further details

Links / references

Edited by Torsten Linz