POC -Implement a LiteLLM load balancer
Problem to solve
Since plan on integrating multiple LLM providers (Vertex, Bedrock), we need to build a request distribution system that routes identical feature requests across different providers and models to optimize our capacity utilization.
Proposal
Let's use LangChain’s LiteLLM routing implementation to distribute requests across provider/model combinations. For the first iteration, we will keep routing simple (e.g., static weighting or round-robin) without token/latency/error-rate–based strategies. In later iterations, introduce Redis as the state store to enable advanced routing strategies back on token consumption, latency, error rates.
Further details
- We need to modify our unit primitives' default model data structure from its current format to an array, then transform our existing data structure to match LiteLLM's expected format.
- Load balancing does only apply to the “GitLab default” model, i.e, any models that users have selected via model selection will not be impacted by the load balancing mechanism.
Links / references
Edited by Martin Wortschack