Decouple LLM generation from evaluation command
Problem to solve
The evaluation command process test cases sequentially.
flowchart LR
A[Read from LangSmith] --> B[Request for LLM generation] --> C[Evaluate LLM response with LLM judge]
This approach has several drawbacks.
- Higher latency
- Each test case requires two separate API calls (generation and evaluation).
- Network round-trip time compounds with each sequential request.
- Total evaluation time scales linearly with the number of test cases.
- Limited throughput
- Maximum number of test cases processed per hour is severely constrained by standard API.
- Impossible to take advantage of horizontal scaling opportunities.
- Higher API costs
- Many providers charge a per-request fee in addition to token costs.
- Providers such as OpenAI and Anthropic often offer discounted pricing for batch requests.
- Error handling complexity
- A failure in a single test case can halt the entire evaluation pipeline.
- Restart logic needs to track which test cases have already been processed.
- Difficult to iterate on evaluation metrics
- If we want to experiment with a new metric, LLM-judge model, etc., we need to re-run the generations.
Proposal
Further details
TBA
Links / references
Edited by Tan Le