Deploy Mixtral 8x22B-it v0.1 on Google Cloud Run with GPUs and use them on evaluation-runner

Following our successful POC with Mistral-7B-it #515198 (closed), this issue focuses on deploying Mixtral 8x22B-it v0.1 on Cloud Run with GPU acceleration. This larger model requires specific configuration for optimal performance.

Tasks:

Set up Cloud Run service with appropriate GPU allocation (likely 8 L4 GPUs)
Configure vLLM parameters for Mixtral 8x22B-it v0.1
Test performance and optimize settings
Update evaluation-runner to use this endpoint
Document specific requirements and usage patterns

This deployment will provide a serverless, on-demand solution for Mixtral 8x22B-it v0.1 evaluations, eliminating the need for dedicated infrastructure.

PS: Due to its size, special attention will need to be paid to cost management while ensuring the model remains accessible for evaluations.

Edited Feb 27, 2025 by Manoj M J