Split duochat documentation eval

Problem

eli5 implements an end to end approach to evaluating documentation search. This is captures the user experience as whole, but for a team trying to improve the score it becomes harder to decide where to focus on.

Proposed Solution

A request for documentation has a few different steps:

Identify from the input that documentation is needed
Format action input
Retrieve the correct documents based on the action input
Generate response based on results

All of these steps need to be logged, and can have independent evaluations