Improve runner's cache servers monitoring
A follow up for #2059 (moved). Alerting for cache server should be repaired (current alerts are not working properly) and improved.
What should be done:
-
fix the RunnersNginxDockerRegistryCacheDown
alert - it's not working properly at the moment -
add monitoring of the number of connections present on cache servers -
add alert for a high number of connections present on cache servers -
prepare a script for cache servers that will: stop nginx, restart all running services ( minio_minio
andregistry
) and start nginx again, to make sure that on restart all hanging connections are terminated and services are started in a "clean" state. -
periodically upload and download files to cache and push/pull images from registry cache and measure the times of such operations; add monitoring and alerting based on such metrics.
Edited by Tomasz Maczukin