Spike: Examine if Google Cloud Run with GPUs is a fit for running vLLM models and create a POC
Google Cloud Run now supports using GPUs for deployment. This can possibly be used to ease deployments of vLLM models for evaluation runner (and in the future, also serve as a platform for devs to spin up models on-demand for local development work)
- Link: https://cloud.google.com/run/docs/configuring/services/gpu#before-you-begin
- Blog post: https://cloud.google.com/blog/products/application-development/run-your-ai-inference-applications-on-cloud-run-with-nvidia-gpus
Tasks:
-
Setup CloudRun with GPU, make sure it works -
Deploy a model like Mistral 7b with vLLM, make sure it works -
In evaluation runner, try to use this setup and talk to the model successfully -
Create a POC where the evaluation runner can talk to the model running on vLLM on CloudRun, and execute an evaluation successfully.
Edited by Manoj M J