Skip to content

monitor inference service with Prometheus

Problem / Opportunity Statement

  • services can stop working, we should find out quickly before users do
    • DeepSeek R1 API
    • Llama 3.3 API
    • Qwen 2.5-VL API
    • Llama 4 Scout API
    • Open WebUI
    • Keycloak, or whatever we end up using to allow self-service generation of API tokens (#219 (closed))
  • a simple HEAD or GET request doesn't actually test that an LLM is working, but a POST request (to actually generate something) does.

Resolution

  • Prometheus blackbox exporter to monitor all the above
    • when it's an actual LLM endpoint, do a POST with a real prompt and a low max_tokens limit so it returns within a few seconds
    • DeepSeek R1
    • Llama 3.3
    • Llama 4 Scout
    • Open WebUI
    • Keycloak UI
  • Prometheus node exporters to monitor
    • gh01 (presently doing nothing because it kept crashing)
    • gh02 (presently serving Llama 3.3 doing nothing)
    • the MI300x node (presently serving DeepSeek R1)
    • the Jetstream2 H100 instance (serving Llama 4 Scout)
    • the Rescloud instance for the web UI
  • Also scrape vLLM/SGlang exporters for all models above
    • DeepSeek R1
    • Llama 3.3
    • Llama 4 Scout
  • Little embedded Grafana dashboard on docs site showing real-time availability of UI and each API
Edited by Chris Martin