Spike: Examine if Google Cloud Run with GPUs is a fit for running vLLM models and create a POC

Google Cloud Run now supports using GPUs for deployment. This can possibly be used to ease deployments of vLLM models for evaluation runner (and in the future, also serve as a platform for devs to spin up models on-demand for local development work)

Link: https://cloud.google.com/run/docs/configuring/services/gpu#before-you-begin
Blog post: https://cloud.google.com/blog/products/application-development/run-your-ai-inference-applications-on-cloud-run-with-nvidia-gpus

Tasks:

Setup CloudRun with GPU, make sure it works
Deploy a model like Mistral 7b with vLLM, make sure it works
In evaluation runner, try to use this setup and talk to the model successfully
Create a POC where the evaluation runner can talk to the model running on vLLM on CloudRun, and execute an evaluation successfully.

Edited Feb 27, 2025 by Manoj M J