Shared Runner Queues - 2017-01-31
NOTE This is a recreation of an issue in its entirety from before the data loss event.
Our shared runners have experienced major slow downs due to DB issues. @tmaczukin and I worked all day on identifying and fixing this. Here is what we know so far related to the shared runner outage.
At around 2:20 AM, a project started over 3,000 builds within a 10 minute period with 24 jobs per commit and 147 commits. This began our major slowdown. Over 1,000 of these builds will remain pending as they do not have runners with the proper tags to pick the builds up.
Compounding on this, we also began to have DB issues as per. This caused major throttling on the shared runners who could then no longer pick up builds. Unfortunately, there isn't much we can do about this at this time as we cannot raise the throttle because that will just cause more DB problems. We are going to need to get the DB issues under control for the runners to begin processing at full speed.
As we can see, yesterday the runners were able to pick up many builds at once and deal with the queue yet today so far the average has only been around 15 builds per runner.
Yesterday:
Today so far:
Is there anything else I've forgotten that is important, @tmaczukin?


