Spike: Examine if Google Cloud Run with GPUs is a fit for running vLLM models and create a POC

Google Cloud Run now supports using GPUs for deployment. This can possibly be used to ease deployments of vLLM models for evaluation runner (and in the future, also serve as a platform for devs to spin up models on-demand for local development work)

Tasks:

  • Setup CloudRun with GPU, make sure it works
  • Deploy a model like Mistral 7b with vLLM, make sure it works
  • In evaluation runner, try to use this setup and talk to the model successfully
  • Create a POC where the evaluation runner can talk to the model running on vLLM on CloudRun, and execute an evaluation successfully.
Edited by Manoj M J