Document Mistral model serving
Choose a serving framework and document how to serve a Mistral & Mixtral models and make it available to AI Gateway. Identify the interface to this serving framework.
Mistral recommends either:
-
vLLM (Apache-2.0 license)
- "A high-throughput and memory-efficient inference and serving engine for LLMs"
- One of our design collab customers are using vLLM for their inferencing.
-
TensorRT-LLM (Apache-2.0 license)
- "TensorRT-LLM provides users with an easy-to-use Python API to define Large Language Models (LLMs) and build TensorRT engines that contain state-of-the-art optimizations to perform inference efficiently on NVIDIA GPUs. TensorRT-LLM also contains components to create Python and C++ runtimes that execute those TensorRT engines."
Edited by Sean Carroll