Metrics endpoint gets progressively slower with rising resource usage
This is an issue with Prometheus' multiprocess collector. I'm not sure yet how to address it (or if it really needs to be), but I want to make some notes for future reference here as I try to work through it.
Background: Recently I was starting to get some weird results in the site metrics. Some metrics seemed to have incomplete data, and I'd occasionally get false alerts from Grafana thinking that a service was down because of a scrape failing. This started getting especially bad today, and I noticed that the monitoring server's scrapes of /metrics
were taking a consistent 10 seconds. There seems to be a timeout involved somewhere (not sure where offhand), but it was probably resulting in incomplete data being returned.
I found some issues on the Prometheus Python client repo talking about similar issues related to gunicorn, and after looking at this recent one, I found there were about 8000 .db
files in my gunicorn prometheus_multiproc_dir
. Deleting these files caused the metrics endpoint to start returning instantly again, and overall CPU load on the server even dropped from ~0.4 to ~0.1 (from this single endpoint getting hit once every 30 seconds!).
As far as I can tell from a quick look through some of the various issues created about this, there isn't really a "proper" solution at this point. Fully restarting gunicorn occasionally (not just reloading) seems like it will probably be fine as a workaround, since that creates a new temp directory.
So overall, this is annoying, but not really a major problem.