vLLM x Model Size Requirements: Iteration I

Customer for self-hosted models are requesting system requirements/reference architectures for deployment of self-hosted models to be used as recommended starting points for your environment. The customers are seeking reasonable ballpark estimates to help them prepare adequate resources before they bring self-hosted models to production. Custom Models customers frequently request information on the system requirements for self-hosted models, to include:

machine specs
CPU/GPU requirements

This first iteration of attempting to meet this need is for GitLab to provide some baseline system requirements for supporting OS models on vLLM. For example, Mistral 8x22B takes at least 8 GPUs to even run. Providing customers with a ballpark of baseline requirements (enabling the feature to even operate) to get started with self-hosted models will enable them to move more confidently to procure the necessary pre-reqs and set up their self-hosted models environment. This first iteration would be a 1-time effort to enable self-hosted models to GA.

Proposal

Establish baseline system requirements by:

Stand up an AI Gateway and GitLab environment, could be a reference environment (with support from @grantyoung )
Choose 3-4 representative OS models and set them up in our GCP area
- The machines to be tested are:
  - a2-highgpu-2g (24 vCPU + 170 GB memory, 2 NVIDIA A100 40GB)
  - a2-highgpu-4g (48 vCPU + 340 GB memory, 4 NVIDIA A100 40GB)
  - a2-highgpu-8g (96 vCPU + 680 GB memory, 8 NVIDIA A100 40GB)
  - g2-standard-24 (24 vCPU + 96 GB memory, 2 NVIDIA L4)
  - g2-standard-96 (96 vCPU + 384 GB memory, 8 NVIDIA L4)
- Divide represented OS models into small (such as Mistral 7B-it), medium (12B, such as Mistral 8x7B-it), large (such as Mistral 8x22)
- compute the token-per-second and the response time on each machine
Load test harness
implement a load test; as a first iteration we can select one of our evals and crank the number of requests per minute (RPM) or requests per second (RPS) as high as possible and take notes per model and RPM/RPS
- Example:

model	RPS	Failure rate	GPUs
mistral	20	90%	1xA100
mistral	10	50%	1xA100
mistral	5	1%	1xA100

Document a starting point for baseline self-hosted model functionality

Definition of Done

We have documented the results of empirical baseline system requirement testing and honed them into recommendations.
Recommendations have been published to custom models documentation
Customers have a reference point for system requirements when choosing among supported inference platforms and models.

Edited Jan 15, 2025 by Susie Bitters