Test follow-up questions in Chat to ensure it can hold conversational context and also deals well with context switches
Problem to solve
One of the main values of an AI chat is the ability to have follow-up questions (contextual conversational handling) allowing users to iterate towards a goal with the help of AI. Chat currently records a history of the conversation which is sent with each new message. The efficacy of the current strategy is unvalidated and unknown. However, user feedback suggests that Duo Chat is believed to perform poorly on follow-up questions.
- Customer quote: "I feel Duo Chat is on par with competition except possibly the conversation aspect, Duo Chat often can’t reference the entire conversation that way others do."
- UX heuristic quote: "sometimes I ask it follow-up questions about previous things I've asked and it doesn't seem to remember to context. In one example, we went from forking a project to dependency proxy for packages. Another example was when I asked it to write unit tests for code it just wrote, and it said it didn't have any code to write unit tests about. It seemed completely oblivous to the conversation we were having."
- Trial customer quote: When I ask follow-up questions Duo tended to reply with “Sorry Dave I can’t do that”, whenever I asked it something.
- Trial customer quote: "It doesn’t seem to remember context, so you have to give it the same context over and over again (7 out of 25)"
- Also manual experiments from @nicollem show that context seems to get lost radomly.
However, we currently don't systematically test how chat performs on follow-up questions. The CEF is currently set up to evaluate single query/response exchanges, rather than multiple exchange over time.
Goals
- Does Duo Chat hold conversational context? If it doesn't,
- Understand why conversational context gets lost.
- Fix Duo Chat so it does not lose conversational context.
- Does Duo Chat deal well with conversational context switching by the user? If it doesn't,
- Understand why context switching does not work in Duo Chat.
- Fix Duo Chat so it deals well with conversational context switches.
Non-goals
Not a goal in the first iteration: Does Duo Chat contribute to its part to making the conversation a good one achieving these and other goals:
- Did Chat solve my problem?
- Did Chat solve it efficiently?
- Was Chat kind?
Why is this not a goal in the first iteration?
From own experience with the barebones LLM, we observe that it almost never loses the historic context of the conversation. So, it seems our chat system somehow messes that up. That seems to be the biggest problem to solve.
Proposal
Part 1: Data Collection & local evaluation:
When this part is done, developers will have a dataset containing conversations that they can replay locally and automatically evaluate with an LLM judge. Then they can dig into the failures.
- Use existing conversation datasets & make them replay-able locally:
- Existing conversations are in the chat bash dataset from July that include a few conversation and in Nicolle's conversation in Figma.
- The goal is to turn those into a data collection and make them replay-able so that we can test locally where it fails. (who: @shinya.maeda; when: @shinya.maeda will add an ETA)
- Collecting more data:
- Change the feedback modal to include “Problem with conversational context (either lost or context switch not understood)” as an option to choose. (who: @shinya.maeda; when: @shinya.maeda will add an ETA)
- Find a way to produce conversation datasets from this feedback if the user has enabled logging. (who: @shinya.maeda; when: @shinya.maeda will add an ETA)
- Ask internal users to turn on logging and to give feedback when the conversation fails
- Judge for evaluating conversations
- Build an LLM judge to evaluate conversations (who: @HongtaoYang; when: @HongtaoYang will add an ETA)
Part 2: Understanding where it fails and fixing these issues
- We will do the forensic work to figure out why the chat failed to hold context
- We will adjust prompts or whatever needs to be adjusted to address the problem
- We may be able to use an LLM for both the analysis of the failure and changing the prompt