Evaluate input/output of each prompt individually
Objective:
The prompt library gives us a total estimation of the Duo Chat quality. However, during the development process, we found out that particular prompts can become the bottleneck causing the full system to return low-quality responses. Thus, to accelerate development and prevent incidents early, we also need to evaluate each Duo Chat prompt individually for a given input/output dataset.
Metric
Compare the generated output of the given prompt+model with the desired or expected one.
Dataset:
TBD, generated with the Claude-3 model for each prompt used by Duo Chat.
Edited by Alexander Chueshev