Skip to content

GitLab

Why GitLab
Pricing
Contact Sales
Explore

Sign in
Get free trial

monitor inference service with Prometheus

Problem / Opportunity Statement

services can stop working, we should find out quickly before users do
- DeepSeek R1 API
- ~~Llama 3.3 API~~
- ~~Qwen 2.5-VL API~~
- Llama 4 Scout API
- Open WebUI
- ~~Keycloak, or whatever we end up using to allow self-service generation of API tokens (#219 (closed))~~
a simple HEAD or GET request doesn't actually test that an LLM is working, but a POST request (to actually generate something) does.

Resolution

Prometheus blackbox exporter to monitor all the above
- when it's an actual LLM endpoint, do a POST with a real prompt and a low max_tokens limit so it returns within a few seconds
- DeepSeek R1
- Llama 3.3
- Llama 4 Scout
- Open WebUI
- ~~Keycloak UI~~
Prometheus node exporters to monitor
- gh01 (presently doing nothing because it kept crashing)
- gh02 (presently ~~serving Llama 3.3~~ doing nothing)
- the MI300x node (presently serving DeepSeek R1)
- the Jetstream2 H100 instance (serving Llama 4 Scout)
- the Rescloud instance for the web UI
Also scrape vLLM/SGlang exporters for all models above
- DeepSeek R1
- ~~Llama 3.3~~
- Llama 4 Scout
Little embedded Grafana dashboard on docs site showing real-time availability of UI and each API

Edited Jun 13, 2025 by Chris Martin

Assignee Loading

Time tracking Loading