Integrate LangSmith Evaluation Framework with AIGW's Prompt Registry
Problem Statement
We need a streamlined approach to connect our evaluation frameworks (Centralized Evaluation Framework/CEF, ELI5) to the prompt registry located in the AI Gateway (AIGW). Currently, our evaluation tooling exists separately from the prompt registry, making it challenging to effectively test prompts as part of the development process, especially when evaluating model upgrades like the recent Claude 3.7 Sonnet work.
Proposed Solution
Implement the LangSmith evaluation approach using pytest/vitest integrations as demonstrated in the shared blog post (https://blog.langchain.dev/pytest-and-vitest-for-langsmith-evals/). This would allow us to:
- Convert ELI5 into a Python library that can be imported as a test dependency in AIGW
- Write standard unit tests against prompts in the registry
- Integrate evaluation directly into our CI/CD pipeline
- Better evaluate model upgrades and prompt changes with standardized metrics
Potential Approaches
- ELI5 as Python Package: Convert ELI5 into a limited Python package that can be installed as a test dependency in AIGW, enabling classic unit tests against prompts.
- LangSmith Integration: Explore the pytest/vitest integration for LangSmith evaluations to run more granular tests on prompts.
- Hybrid Approach: Use both the package approach (already in progress in MR !2030) while exploring the new LangSmith testing framework integration in parallel.
Evaluation Metrics to Consider
Based on the Claude 3.7 Sonnet evaluation work:
- Quality metrics: helpfulness, comprehensiveness, correctness, readability, accuracy
- Performance metrics: p50 latency, p95/p99 latency, time to first token
- Response characteristics: token count, conciseness
Evaluation Improvement Needs
- Add first token latency measurements to evaluation UI (reference: issue #646)
- Add conciseness evaluators to Duo Chat evaluations
- Improve post-processing/analysis steps in CEF to better understand which types of questions result in different response characteristics
Related Work
- Epic: Consolidate AI Evaluation Tooling #37
- Issue: Prompt Library #622
- MR: feat: integrate eli5 and add eval command !2030
- Related issue from Duo Workflow: #244
- Show the first token latency in LangSmith #646
- Claude 3.7 Sonnet Duo Chat Rollout Plan #521058
Next Steps
- Review and discuss the approaches with the team to determine the best path forward
- Determine if we should pursue both approaches in parallel (ELI5 as a package and LangSmith integration)
- Create implementation plan with specific tasks
- Set up proof-of-concept integration to demonstrate viability
- Add conciseness and first-token latency metrics to evaluations
Edited by David O'Regan