Introduce long polling for CI requests
GitLab Runner uses active polling to ask GitLab for a new builds:
How it works now?
- GitLab Runner ask GitLab roughly every 3 seconds for a new builds (this is configurable interval, with default set to 3s),
- GitLab looks for runner information and update
- GitLab looks at a list of builds and decide if this build should be processed by this runner (this is done using single SQL query),
- GitLab returns immediately with information
do not have,
- Runner ask for a new build after 3 seconds if no build were received last time,
- If Runner receives a build it will ask immediately after receive,
- If runner receives
40xit will back off for one hour,
- If runner receives
50xit will retry again in a few seconds,
- Runner uses HTTP keep-alive and reuses existing TCP connections as much as possible: shared runners which process up to 200 builds at single time require about 6-7 active connections to GitLab to deliver information and update status of builds,
- Shared Runners with
concurrent=200are using one requests every 1 second, so if there's no builds it will generate about 3600 requests per hour asking for new build.
On GitLab.com we have:
- about 2787 runners active within last hour:
Ci::Runner.where('contacted_at > ?', 1.hour.ago).count,
- CI requests generate about 20-40mln requests per day reference,
- on average it takes 32ms to process a request reference.
- currently we are not entirely sure that large amount of CI requests is a culprit of GitLab.com problems,
- currently we don't have exact information about cost of running CI on GitLab since we run uniform infrastracture,
- CI runners can use full capacity of our uniform infrastracture, because the load comes from different IPs.
The reasons for the improvement:
- large amount of CI requests seems to put a lot of pressure on workers in case of downtime,
- we built our own DDOS service that may hammer our workers,
- active polling is not good and scalable way to handle big amount of runners,
- reduce database pressure needed to update state of runners, since we do see a lot of vacuuming on Postgres database.
Ideas how to improve that:
- rate limit CI requests by dropping: this is done already, we limit up to 10 TCP connections per IP,
- handle CI requests by dedicated part of infrastructure: have a separate part of infrastructure to only process CI requests: https://gitlab.com/gitlab-com/infrastructure/issues/320,
- change active polling interval from 3 to 6 or other value, introduce back-off if there's no builds,
- queue CI requests on Workhorse to slow down Runners in case when we can't process them in time: https://gitlab.com/gitlab-com/infrastructure/issues/320#note_13739444
- implement long polling: Unicorn that we use doesn't work well with long polling requests, we could implement that in GitLab Workhorse, by introducing some form of queueing on Redis/Workhorse,
- better handle
50xto implement exponential backoff,
- find a way to change how GitLab Runner connects with GitLab to receive new builds,
- make events run in
build/registerto be fully asynchronous: right now it's possible that event will trigger file system read if a build is being picked to processing, this can lead to problems when there's is slow access to disk and can show misleading problems dumping slow SQL queries from database:
SELECT FOR UPDATE.