monitor inference service with Prometheus
Problem / Opportunity Statement
- services can stop working, we should find out quickly before users do
- DeepSeek R1 API
Llama 3.3 APIQwen 2.5-VL API- Llama 4 Scout API
- Open WebUI
Keycloak, or whatever we end up using to allow self-service generation of API tokens (#219 (closed))
- a simple HEAD or GET request doesn't actually test that an LLM is working, but a POST request (to actually generate something) does.
Resolution
- Prometheus blackbox exporter to monitor all the above
- when it's an actual LLM endpoint, do a POST with a real prompt and a low
max_tokenslimit so it returns within a few seconds -
DeepSeek R1 -
Llama 3.3 -
Llama 4 Scout -
Open WebUI Keycloak UI
- when it's an actual LLM endpoint, do a POST with a real prompt and a low
- Prometheus node exporters to monitor
-
gh01 (presently doing nothing because it kept crashing) -
gh02 (presently serving Llama 3.3doing nothing) -
the MI300x node (presently serving DeepSeek R1) -
the Jetstream2 H100 instance (serving Llama 4 Scout) -
the Rescloud instance for the web UI
-
- Also scrape vLLM/SGlang exporters for all models above
-
DeepSeek R1 Llama 3.3-
Llama 4 Scout
-
- Little embedded Grafana dashboard on docs site showing real-time availability of UI and each API
Edited by Chris Martin