self-healing for inference service

Problem / Opportunity Statement

This takes #233 (closed) a step further.

SGLang has a built-in watchdog that restarts the server if it gets stuck on something. We've seen this in action already. We'll see a Prometheus alert because the model has stopped responding, and it'll come back on its own a few minutes later.

vLLM doesn't appear to have this feature. If it gets stuck, it stays stuck until one of us restarts the container. This causes sporadic need for time-sensitive administrator intervention -- it doesn't have to be that way.

Ditto for Open WebUI, which now appears to have a memory leak.

Resolution

Figure out some kind of self-healing for the Llama 4 Scout vLLM backend, and Open WebUI. There are many different watchdog services for Docker containers. Or we could switch from Docker to k3s and use a Liveness probe.

Edited Jun 11, 2025 by Chris Martin