Integrate LangSmith Evaluation Framework with AIGW's Prompt Registry

Problem Statement

We need a streamlined approach to connect our evaluation frameworks (Centralized Evaluation Framework/CEF, ELI5) to the prompt registry located in the AI Gateway (AIGW). Currently, our evaluation tooling exists separately from the prompt registry, making it challenging to effectively test prompts as part of the development process, especially when evaluating model upgrades like the recent Claude 3.7 Sonnet work.

Proposed Solution

Implement the LangSmith evaluation approach using pytest/vitest integrations as demonstrated in the shared blog post (https://blog.langchain.dev/pytest-and-vitest-for-langsmith-evals/). This would allow us to:

  1. Convert ELI5 into a Python library that can be imported as a test dependency in AIGW
  2. Write standard unit tests against prompts in the registry
  3. Integrate evaluation directly into our CI/CD pipeline
  4. Better evaluate model upgrades and prompt changes with standardized metrics

Potential Approaches

  1. ELI5 as Python Package: Convert ELI5 into a limited Python package that can be installed as a test dependency in AIGW, enabling classic unit tests against prompts.
  2. LangSmith Integration: Explore the pytest/vitest integration for LangSmith evaluations to run more granular tests on prompts.
  3. Hybrid Approach: Use both the package approach (already in progress in MR !2030) while exploring the new LangSmith testing framework integration in parallel.

Evaluation Metrics to Consider

Based on the Claude 3.7 Sonnet evaluation work:

  • Quality metrics: helpfulness, comprehensiveness, correctness, readability, accuracy
  • Performance metrics: p50 latency, p95/p99 latency, time to first token
  • Response characteristics: token count, conciseness

Evaluation Improvement Needs

  1. Add first token latency measurements to evaluation UI (reference: issue #646)
  2. Add conciseness evaluators to Duo Chat evaluations
  3. Improve post-processing/analysis steps in CEF to better understand which types of questions result in different response characteristics

Related Work

Next Steps

  1. Review and discuss the approaches with the team to determine the best path forward
  2. Determine if we should pursue both approaches in parallel (ELI5 as a package and LangSmith integration)
  3. Create implementation plan with specific tasks
  4. Set up proof-of-concept integration to demonstrate viability
  5. Add conciseness and first-token latency metrics to evaluations
Edited by David O'Regan