CI distributed heavy polling and exclusive row locking for seconds takes GitLab.com down.
Timeline
Over the last months we have been suffering from high load on the DB.
As some of you may know we are using postgresql 9.2 (PG from now on) as our backend DB. PG uses a technique called vacuum
to keep table and index bloat under control, as to avoid rolling over the transaction id. This is a cleaning process that gets triggered automatically (or manually) by the database when it accumulates enough garbage => a garbage collection process.
Since a couple of months we have been accusing vacuum as the source of the evil here simply because every time we had problems it was there, but it is not its fault completely. The pattern is that at some point in the day the database starts increasing system load up to ~300 and stays there for a while, until it starts going down eventually.
This can be a minute, or 5 or more. This makes GitLab.com slow, and can even take it down with 502 errors.
We worked hard to start using pg_repack
, a postgresql extension that allowed us to run an almost lock free vacuum.
This helped a lot, but didn't make the problem go away completely - we are just having a load of 300 for 30 seconds now, which only slows GitLab.com but does not take it completely down. Sadly pingdom disagrees with this, and as you can see we are being punished in our uptime because of this.
This can be because pingdom has a lower tolerance for latency than what we do, but it doesn't matter because our users were complaining, and we were suffering this ourselves.
Our monitoring was also telling us was that there was a huge spike in locks when the high load was showing up.
So we started investigating that path down. These investigations showed that CI is misbehaving.
The issue is that every runner will poll for a new build once every 3 seconds (by default even more). GitLab.com is growing, so every day more runners are polling for builds to run. This translates into having ~3.3M polls per hour that are reaching the database to get nothing out of it. In total we only serve 2K positive builds to run.
But that's not the main issue. The big problem comes when a build is granted, then the API endpoint will perform a select for update
locking the row, and then it will go down to filesystem to fetch a commit from the git layer, which can take several seconds, delaying this exclusive lock for all this time. (Check the code here, here and here )
If we have more than 1 runner trying to lock on the same build, like the shared runners do, we will get a race condition of exclusive locks at the database level. Hence the high load: processes that are waiting on each other now allowing the database to do anything else, therefore no query can be executed at all. If it happens that vacuum wants to also get a lock to reorganize the table and most importantly, the indexes, GitLab.com will go down.
This Sunday we finally reached the tipping point where the high load on the database was constant for one hour, we reacted blocking API access to GitLab.com completely, the load went away. We narrowed our aim down just builds/register
API path and the load kept being down.
With this data we decided to start throttling down this API endpoint trying to reach the sweet spot where we are pushing back on the waste of resources while we can continue serving CI, clearly we did not reach this point yet and are still working on it.
We in fact created a situation in which this problem became much more evident and clear:
When this happened we where in the DB and managed to get CI with a smoking gun
679 exclusive locks like SELECT "ci_builds".* FROM "ci_builds" WHERE "ci_builds"."type" IN ('Ci::Build') AND "ci_builds"."id" = $1 LIMIT 1 FOR UPDATE
Check yourself if you want in this file: ci-exclusive-locks
So what are we doing about it
now
At this point we are just defending GitLab.com from CI.
To do this we are rate limiting requests only allowing to have 10 concurrent requests on the CI endpoint at any given time.
The reasoning behind this is that these operations should be fast (sometimes when reaching the FS they are not) and that this way all the runners will be attended, apparently we are not there yet and still need to work on improving the rate limiting, but so far we have noticed that our shared runners are working fine in general (not a lot of builds in queue unattended)
We sincerely apologize for any trouble this is causing, but we think that keeping GitLab.com up and available is a better user experience than having a really fast CI build start, you will need to wait a few more seconds.
Short term
- Remove the git access from the api request by performing an async call: https://gitlab.com/gitlab-org/gitlab-ce/merge_requests/6650
- Queue api requests by delaying at the workhorse level instead of HA proxy: gitlab-org/gitlab-workhorse!65 (merged)
- Add a caching layer in front of GitLab.com to only perform build DB queries once every now and then. (can't find the issue sadly)
At this point we could remove the rate limitation because it would not be necessary anymore.
Mid term
Move away from doing polling to do callbacks by registering a url where we should call you when you have a build to run. The goal should be to completely remove this polling.