Skip to content

Decouple LLM generation from evaluation command

Problem to solve

The evaluation command process test cases sequentially.

flowchart LR
  A[Read from LangSmith] --> B[Request for LLM generation] --> C[Evaluate LLM response with LLM judge]

This approach has several drawbacks.

  • Higher latency
    • Each test case requires two separate API calls (generation and evaluation).
    • Network round-trip time compounds with each sequential request.
    • Total evaluation time scales linearly with the number of test cases.
  • Limited throughput
    • Maximum number of test cases processed per hour is severely constrained by standard API.
    • Impossible to take advantage of horizontal scaling opportunities.
  • Higher API costs
    • Many providers charge a per-request fee in addition to token costs.
    • Providers such as OpenAI and Anthropic often offer discounted pricing for batch requests.
  • Error handling complexity
    • A failure in a single test case can halt the entire evaluation pipeline.
    • Restart logic needs to track which test cases have already been processed.
  • Difficult to iterate on evaluation metrics
    • If we want to experiment with a new metric, LLM-judge model, etc., we need to re-run the generations.

Proposal

Further details

TBA

Links / references

Edited by Tan Le