Deploy Mixtral 8x22B-it v0.1 on Google Cloud Run with GPUs and use them on evaluation-runner

Following our successful POC with Mistral-7B-it #515198 (closed), this issue focuses on deploying Mixtral 8x22B-it v0.1 on Cloud Run with GPU acceleration. This larger model requires specific configuration for optimal performance.

Tasks:

  • Set up Cloud Run service with appropriate GPU allocation (likely 8 L4 GPUs)
  • Configure vLLM parameters for Mixtral 8x22B-it v0.1
  • Test performance and optimize settings
  • Update evaluation-runner to use this endpoint
  • Document specific requirements and usage patterns

This deployment will provide a serverless, on-demand solution for Mixtral 8x22B-it v0.1 evaluations, eliminating the need for dedicated infrastructure.

PS: Due to its size, special attention will need to be paid to cost management while ensuring the model remains accessible for evaluations.

Edited by Manoj M J