Skip to content

Introduce long polling for CI requests

GitLab Runner uses active polling to ask GitLab for a new builds:

How it works now?

  • GitLab Runner ask GitLab roughly every 3 seconds for a new builds (this is configurable interval, with default set to 3s),
  • GitLab looks for runner information and update contacted_at and version in database,
  • GitLab looks at a list of builds and decide if this build should be processed by this runner (this is done using single SQL query),
  • GitLab returns immediately with information do have, do not have,
  • Runner ask for a new build after 3 seconds if no build were received last time,
  • If Runner receives a build it will ask immediately after receive,
  • If runner receives 40x it will back off for one hour,
  • If runner receives 50x it will retry again in a few seconds,
  • Runner uses HTTP keep-alive and reuses existing TCP connections as much as possible: shared runners which process up to 200 builds at single time require about 6-7 active connections to GitLab to deliver information and update status of builds,
  • Shared Runners with concurrent=200 are using one requests every 1 second, so if there's no builds it will generate about 3600 requests per hour asking for new build.

On GitLab.com we have:

  • about 2787 runners active within last hour: Ci::Runner.where('contacted_at > ?', 1.hour.ago).count,
  • CI requests generate about 20-40mln requests per day reference,
  • on average it takes 32ms to process a request reference.

Problems:

  • currently we are not entirely sure that large amount of CI requests is a culprit of GitLab.com problems,
  • currently we don't have exact information about cost of running CI on GitLab since we run uniform infrastracture,
  • CI runners can use full capacity of our uniform infrastracture, because the load comes from different IPs.

The reasons for the improvement:

  • large amount of CI requests seems to put a lot of pressure on workers in case of downtime,
  • we built our own DDOS service that may hammer our workers,
  • active polling is not good and scalable way to handle big amount of runners,
  • reduce database pressure needed to update state of runners, since we do see a lot of vacuuming on Postgres database.

Ideas how to improve that:

  • rate limit CI requests by dropping: this is done already, we limit up to 10 TCP connections per IP,
  • handle CI requests by dedicated part of infrastructure: have a separate part of infrastructure to only process CI requests: https://gitlab.com/gitlab-com/infrastructure/issues/320,
  • change active polling interval from 3 to 6 or other value, introduce back-off if there's no builds,
  • queue CI requests on Workhorse to slow down Runners in case when we can't process them in time: https://gitlab.com/gitlab-com/infrastructure/issues/320#note_13739444
  • implement long polling: Unicorn that we use doesn't work well with long polling requests, we could implement that in GitLab Workhorse, by introducing some form of queueing on Redis/Workhorse,
  • better handle 50x to implement exponential backoff,
  • find a way to change how GitLab Runner connects with GitLab to receive new builds,
  • make events run in build/register to be fully asynchronous: right now it's possible that event will trigger file system read if a build is being picked to processing, this can lead to problems when there's is slow access to disk and can show misleading problems dumping slow SQL queries from database: SELECT FOR UPDATE.