Integrate LangSmith Evaluation Framework with AIGW's Prompt Registry

Problem Statement

We need a streamlined approach to connect our evaluation frameworks (Centralized Evaluation Framework/CEF, ELI5) to the prompt registry located in the AI Gateway (AIGW). Currently, our evaluation tooling exists separately from the prompt registry, making it challenging to effectively test prompts as part of the development process, especially when evaluating model upgrades like the recent Claude 3.7 Sonnet work.

Proposed Solution

Implement the LangSmith evaluation approach using pytest/vitest integrations as demonstrated in the shared blog post (https://blog.langchain.dev/pytest-and-vitest-for-langsmith-evals/). This would allow us to:

Convert ELI5 into a Python library that can be imported as a test dependency in AIGW
Write standard unit tests against prompts in the registry
Integrate evaluation directly into our CI/CD pipeline
Better evaluate model upgrades and prompt changes with standardized metrics

Potential Approaches

ELI5 as Python Package: Convert ELI5 into a limited Python package that can be installed as a test dependency in AIGW, enabling classic unit tests against prompts.
LangSmith Integration: Explore the pytest/vitest integration for LangSmith evaluations to run more granular tests on prompts.
Hybrid Approach: Use both the package approach (already in progress in MR !2030) while exploring the new LangSmith testing framework integration in parallel.

Evaluation Metrics to Consider

Based on the Claude 3.7 Sonnet evaluation work:

Quality metrics: helpfulness, comprehensiveness, correctness, readability, accuracy
Performance metrics: p50 latency, p95/p99 latency, time to first token
Response characteristics: token count, conciseness

Evaluation Improvement Needs

Add first token latency measurements to evaluation UI (reference: issue #646)
Add conciseness evaluators to Duo Chat evaluations
Improve post-processing/analysis steps in CEF to better understand which types of questions result in different response characteristics

Related Work

Epic: Consolidate AI Evaluation Tooling #37
Issue: Prompt Library #622
MR: feat: integrate eli5 and add eval command !2030
Related issue from Duo Workflow: #244
Show the first token latency in LangSmith #646
Claude 3.7 Sonnet Duo Chat Rollout Plan #521058

Next Steps

Review and discuss the approaches with the team to determine the best path forward
Determine if we should pursue both approaches in parallel (ELI5 as a package and LangSmith integration)
Create implementation plan with specific tasks
Set up proof-of-concept integration to demonstrate viability
Add conciseness and first-token latency metrics to evaluations

Edited Feb 27, 2025 by David O'Regan