Skip to content

POC -Implement a LiteLLM load balancer

Problem to solve

Since plan on integrating multiple LLM providers (Vertex, Bedrock), we need to build a request distribution system that routes identical feature requests across different providers and models to optimize our capacity utilization.

Proposal

Let's use LangChain’s LiteLLM routing implementation to distribute requests across provider/model combinations. For the first iteration, we will keep routing simple (e.g., static weighting or round-robin) without token/latency/error-rate–based strategies. In later iterations, introduce Redis as the state store to enable advanced routing strategies back on token consumption, latency, error rates.

Further details

  • We need to modify our unit primitives' default model data structure from its current format to an array, then transform our existing data structure to match LiteLLM's expected format.
  • Load balancing does only apply to the “GitLab default” model, i.e, any models that users have selected via model selection will not be impacted by the load balancing mechanism.

Links / references

Edited by Martin Wortschack