Performance of auto-scaled runners
Recently we are hit by long build queues. It started happening when we merged Knapsack and it put a lot of pressure on our infrastracture. We see that our runners infrastructure needs to be improved:
-
On docker-ci-1/2
andshared-runners-manager-1
we haversyslog
configured. Runner generates a lot of logs andrsyslog
quickly locks itself unable to send all log messages to logging server using TCP connection. This results in slow down of processing build queues, slow down of builds (unable to post logs to rsyslog from runner process). -
Docker-machine locks main queue on Runner allowing it to process only about 500-600 builds per hour: https://gitlab.com/gitlab-org/gitlab-ci-multi-runner/issues/1390. -
We are hit by rate limiting on Digital Ocean. Single API token can be used only 5000 per hour. Docker-machine creation eats about 13 requests from that budget. We are also need 2 requests to remove machine and 1 request every 6s to check if the VM is running. This leads us to something like 300 builds processable by single GitLab Runner. We are thinking with @tmaczukin about a ways of working around this limitation and having a bigger limit.
This all leads to long build queues. Sometimes up to 40 minutes: https://checkmk.gitlap.com/gitlab/check_mk/index.py?start_url=%2Fgitlab%2Fcheck_mk%2Fview.py%3Fview_name%3Dservice%26service%3DCI_RUNNERS%26host%3Ddb4.cluster.gitlab.com.
Hopefully we can solve 1., improve 2. and workaround 3. in this week to have a good experience again.