Deploy Mixtral 8x22B-it v0.1 on Google Cloud Run with GPUs and use them on evaluation-runner
Following our successful POC with Mistral-7B-it #515198 (closed), this issue focuses on deploying Mixtral 8x22B-it v0.1 on Cloud Run with GPU acceleration. This larger model requires specific configuration for optimal performance.
Tasks:
- Set up Cloud Run service with appropriate GPU allocation (likely 8 L4 GPUs)
- Configure vLLM parameters for Mixtral 8x22B-it v0.1
- Test performance and optimize settings
- Update evaluation-runner to use this endpoint
- Document specific requirements and usage patterns
This deployment will provide a serverless, on-demand solution for Mixtral 8x22B-it v0.1 evaluations, eliminating the need for dedicated infrastructure.
PS: Due to its size, special attention will need to be paid to cost management while ensuring the model remains accessible for evaluations.
Edited by Manoj M J