self-healing for inference service
Problem / Opportunity Statement
This takes #233 (closed) a step further.
SGLang has a built-in watchdog that restarts the server if it gets stuck on something. We've seen this in action already. We'll see a Prometheus alert because the model has stopped responding, and it'll come back on its own a few minutes later.
vLLM doesn't appear to have this feature. If it gets stuck, it stays stuck until one of us restarts the container. This causes sporadic need for time-sensitive administrator intervention -- it doesn't have to be that way.
Ditto for Open WebUI, which now appears to have a memory leak.
Resolution
Figure out some kind of self-healing for the Llama 4 Scout vLLM backend, and Open WebUI. There are many different watchdog services for Docker containers. Or we could switch from Docker to k3s and use a Liveness probe.