Evaluate input/output of each prompt individually

Objective:

The prompt library gives us a total estimation of the Duo Chat quality. However, during the development process, we found out that particular prompts can become the bottleneck causing the full system to return low-quality responses. Thus, to accelerate development and prevent incidents early, we also need to evaluate each Duo Chat prompt individually for a given input/output dataset.

Metric

Compare the generated output of the given prompt+model with the desired or expected one.

Dataset:

TBD, generated with the Claude-3 model for each prompt used by Duo Chat.

Edited Mar 13, 2024 by Alexander Chueshev