Decouple LLM generation from evaluation command

Problem to solve

The evaluation command process test cases sequentially.

flowchart LR
  A[Read from LangSmith] --> B[Request for LLM generation] --> C[Evaluate LLM response with LLM judge]

This approach has several drawbacks.

Higher latency
- Each test case requires two separate API calls (generation and evaluation).
- Network round-trip time compounds with each sequential request.
- Total evaluation time scales linearly with the number of test cases.
Limited throughput
- Maximum number of test cases processed per hour is severely constrained by standard API.
- Impossible to take advantage of horizontal scaling opportunities.
Higher API costs
- Many providers charge a per-request fee in addition to token costs.
- Providers such as OpenAI and Anthropic often offer discounted pricing for batch requests.
Error handling complexity
- A failure in a single test case can halt the entire evaluation pipeline.
- Restart logic needs to track which test cases have already been processed.
Difficult to iterate on evaluation metrics
- If we want to experiment with a new metric, LLM-judge model, etc., we need to re-run the generations.