Draft blueprint to consolidate evaluation tooling (CEF, ELI5, Langsmith)

Today, we have various tooling for evaluations without a clear strategy on when to use which, or what the differences are. In this issue, we will aim to clarify that either through consolidating tooling into a single strategy, defining the separations of responsibilities, or some other thing. We will do this in the form of an architectural blueprint, because any proposed changes will need to be planned for implementation, and therefore we are not limited by what exists today but rather what we want to see in the future.

Tooling

Centralized Evaluation Framework:

Used by feature teams for evaluating features at scale; representative of production.
Used by stakeholders to understand how quality is tracking over time (trending up, trending down)
Generally most relevant at the end of a features development cycle due to its alignment with production and scale

LangSmith:

Used by feature teams for experimenting on prompts, either new features or existing features with rapid iteration cycles
Used by feature teams to start their datasets, and by stakeholders and counterparts to add new failure examples to datasets
Generally most relevant on Day 1 of feature development and when making targeted changes to the prompt (running datasets against individual prompt changes) due to its ease of use and quick feedback cycle

ELI5:

Used by feature teams as a wrapper around LangSmith, in order to automate the creation/results of their datasets, evaluation scripts, and CI/CD pipelines
Generally most relevant after culminating a feature that we know we want to invest further in, by offering automations for the common tasks that will be required while advancing a feature to GA

Edited Sep 19, 2024 by Michelle Gill