Docs: How do I know if my Duo Chat prompt change makes Chat better?

Summary

We have several methods available for local evals of prompt changes to Duo Chat. This is separate from the Centralized Evaluation Framework for Chat, which runs once daily on gitlab.com. This issue is about evaluating the prompt changes before the changes are on gitlab.com.

The definition of done for this issue is that the various methods are documented and explained in the GitLab developer docs.

One side effect of this documentation should be a more streamlined, standardized approach to prompt change evaluation for Duo Chat.

Do a local evaluation of any prompt change so that you have confidence it is working well. This evaluation should build confidence but does not need to be comprehensive nor should it take a very long time. If the evaluation is taking longer than a few hours, ask for help.
Always introduce changes behind a feature flag.
Enable the feature flag for @AndrasHerczeg so it is reflected in the CEF run (it uses his PAT currently).
If the numbers from CEF look good after the daily run, roll it out to everyone
Remove the feature flag

Good for:
- Evaluating results for Zero shot prompt changes (looks at tool selection only)
- Low effort (can be run via CI so doesn't even require an API token for Anthropic)
Not good for:
- Tool prompt changes (only looks at tool selection)
- Qualitative eval of responses (only gives true/false on whether expected tool is selected)

Good for:
- Evaluating quality of results for Epic and Issue questions (uses 2 LLMs to evaluate answer correctness)
- Low effort (can be run via CI so doesn't even require an API token for Anthropic)
Not good for:
- Quick test of one change (runs 93 test questions, doing a smaller set would require manually commenting out some of the test code)
- Evaluating tools other than Epic/Issue (only asks issue an epic questions)

Existing documentation: (Internal only) Slack demo video from Lesley and (Internal only) Notebook template for tool selection

Good for:
- Zero shot changes (testing tool selection)
- Any evaluation that has a true/false answer (zero shot is an example of this, we know which tool we expect for each question type)
Not good for:
- More complex evaluations, such as correctness or readability of answers.

Existing documentation: (Internal only) Google drive demo vide from Tim.

Good for:
- Evaluating prompt versions/changes over time, seeing Langsmith traces for each run
- Running multiple types of evaluators (langsmith has built in evaluators but you can also create your own)
- Running tests for multiple datasets (can create datasets to be re-used).
Not good for:
- Any evaluation type where we don't arlready have a langsmith dataset, because those need to be created.

Good for:
- Comprehensive evaluation - use a large dataset
- Qualitative evaluation - uses other LLMS to evaluate the quality of Chat responses
Not good for:
- Low effort set up (requires cloning datasets, configuring prompt library, multiple API keys, and import epic and issue data into your GDK)
- Quick evaluation (takes several hours to run)