CI Queues performance problem since 2018-10-23

2018-10-23

We've recently seen problems with CI Queues performance on GitLab.com several times, but yesterday it was a first day that we had almost 12 hours of constant outage:

few general graphs:

excluding DO-based Runner Managers:

including DO-based Runner Managers:

Looking on the graphs it quickly become obvious, that the problem is caused by us exceeding some API Rate Limits in GCP. We've seen this before and GCP console confirmed our suspicions:

Since we still haven't found what causes this quota usage in our configuration, we've decided to quickly enable Digital Ocean based Runners to handle the growth of the queue. This would make us able to investigate what's up with GCP, while user's jobs would be handled properly. We've been using this solution few times in the past... but unfortunately this time it didn't work.

On the graphs above, that are including DO-based Runner Managers, we can see a big and constant number of jobs being handled by DO managers between 12-17 UTC. Since the CI queues were still growing, we've focused on making the DO Runners working. Quickly we've found out, that the API limits were exceeded. We've stopped Runners for a while (this is the drop on the graph between 14-15:30 UTC) to restore the limits and try again. Unfortunately, after the machines were started again, the situation repeated.

At this moment in logs we've found out several lines, where Docker Machine binary calls were failing with memory not available errors. Looking on the Memory Usage per Runner Manager node graph, we can see that memory usage on these two machines was hitting 100%. So we've decided to resize the machines and see if this helps. On the graphs, at ~17UTC we can see that there is a drop of memory usage.

However the size of available memory was not the main cause of so big API utilization for DO based Runners. Just before we've started them, we've remembered that when we've switched this environment off and left it as a backup one, we've set IdleCount = 0 in the autoscaling configuration.

As a side note, a little background of how autoscaling works (in context of this issue) to understand why it was a problem:

If IdleCount is set to 0, then each time when a job is received, Runner tries to provision a machine, until the limit is reached. This means that we could have even 100-200 creation operations at once, which quickly exceeds available API Rate Limits.

If IdleCount is set to a value bigger than 0, then Runner manages the number of available machines separately from jobs. When a new job is received, then it asks for a free machine and waits until there will be any. In another routine, the scheduler is ensuring, that there is always IdleCount machines in idle state, and in case when a job receives one of them, it requests a new machine creation. This means that in case of 100-200 jobs received in short time by Runner, new machines are being created more or less in batches of IdleCount machines. Instead of 100-200 at once.

Noticing this we've reconfigured the IdleCount setting to 15, and next 30, to reduce the API utilization of our DO environment. And, as we can see, DO Runner Managers after ~17UTC started to behave normally. Unfortunately, the speed of machines creation in DO was much slower than speed of GCP. So while we've got some additional capacity, this didn't have much impact on the general CI Queues performance.

We've got back to the GCP API usage issue and started again to look into logs with a hope, that we will finally find out what causing the quota saturation. Unfortunately, nothing in the logs nor on the graphs was giving the answer.

At ~20UTC the general load of CI jobs on GitLab.com went enough down to resolve the API Rate Limits problem. We've monitored the system for the next hour and since it was still OK, we've updated the status of incident to Monitoring. Since it was several hours of working on this issue we've decided to get a break and get back to this issue at next day Morning with a fresh mind.

2018-10-24

As decided a day before, we've started digging into what exactly generates the Operation read requests that are exceeding API Limit quotas. After reading through GCP documentation we've found out, that this quota is related to all operations generated by *OperationsService.Get (found at https://cloud.google.com/compute/docs/api-rate-limits).

From the documentation it's not clear what *OperationsService.Get exactly means, but Go's library for GCP names the structures, services and methods in a way that makes possible to translate them 1:1 to some API names. So we've decided to look for a *OperationsService structs that have a Get() method.

We've grep-ed the sources of the GCP library and indeed found three services that are named:

We had three projects, where this could be used:

GitLab Runner - which has the GCP library in the dependencies, but it's not used explicitly. Because of this and the fact that Docker Machine is used though system calls of the binary it was rarely possible that this will be the source of the issue. However we've performed the search also here and it confirmed that it's not a case.
GCP exporter - a tool that we use to track usage of some of our GCP quotas and number of instances. It uses the GCP library explicitly. However there was no usage of any of the above services.
Docker Machine - which uses the GCP library when the google driver is used. And bingo! This was the source!

Docker Machine uses GlobalOperationsService and RegionOperationsService as a method to check if some requested operations were finalized. And it's done in a loop... https://gitlab.com/gitlab-org/ci-cd/docker-machine/blob/master/drivers/google/compute_util.go#L447-459:

// waitForRegionalOp waits for the regional operation to finish.
func (c *ComputeUtil) waitForRegionalOp(name string) error {
	return c.waitForOp(func() (*raw.Operation, error) {
		return c.service.ZoneOperations.Get(c.project, c.zone, name).Do()
	})
}

// waitForGlobalOp waits for the global operation to finish.
func (c *ComputeUtil) waitForGlobalOp(name string) error {
	return c.waitForOp(func() (*raw.Operation, error) {
		return c.service.GlobalOperations.Get(c.project, name).Do()
	})
}

The waitForOp() is a simple, infinite loop with a 1s sleep, that exits in case of any error of properly finished operation. But until the operation is not finished, it generates the *OperationsService.Get() call each second.

I've done a manual machine create and remove calls on one of our Runner Nanagers in GCP and this is the effect:

~# for file in create-machine.log remove-machine.log; do echo $file; grep "DBG | Operation" $file | wc -l; done
create-machine.log
15
remove-machine.log
55

The numbers may differ depending on a specific case of each requested operation, but from what I see, machine removal always generates few times more checks than machine create command.

As a side note: from what I found, what causes so bigger time of machine removal is disk deletion. It takes ~45-60 seconds for each machine.

Finally, if there is any slowdown in operations handling on GCP side and/or if we're starting scheduling more machines, then we start exceeding the API Rate Limits. Which ends with operations taking more time and generating more checks which generates a death loop that gets our capacity to almost 0. After Rate Limits are restored, which basing on graphs takes 30-45 minutes, we're again able to start scheduling machines, but because of the backlog of jobs, we try to create a lot of them, which finally again brings us to the rate limit exceeding death loop. And the situation repeats for several hours until the load of jobs gets down and the system is able to finalize all operations before exceeding the limits.

This is why we see a regular up-and-down changes on the graphs, when they are limited to GCP-based Runner Managers.

We've found the reason at ~13:00 UTC, updated the status of the incident to Identified and started working on a fix. Fortunately, also in case of such problems, we're managing our own fork of Docker Machine where we can quickly add required changes, deploy them to our infrastructure, and if they work properly - contribute the changes to the upstream project of Docker Machine authors.

At this moment (2018-10-24 19:15 UTC) we've prepared a MR with required changes - gitlab-org/ci-cd/docker-machine!7 (merged). The binary was tested manually, and the difference in number of *OperationsService.Get() calls is significant, especially on removal commands:

~# grep -o 'DBG | Operation "operation-' *.log | sort | uniq -c 
     10 create-machine.new.log:DBG | Operation "operation-
     15 create-machine.old.log:DBG | Operation "operation-
     11 remove-machine.new.log:DBG | Operation "operation-
     55 remove-machine.old.log:DBG | Operation "operation-

10 vs 15 calls for a creation command
11 vs 55 (!!!) calls for a removal command.

As mentioned before, the specific numbers of calls may differ by few up or down, but the general tendency is visible. Using the above comparison of two manual tests - with old, and patched version of Docker Machine - we should expect ~50 requests less per one machine creation/removal. This is a 70% reduction of generated requests!

What's left now is to merge this MR and deploy new Docker Machine version on our Runner Managers. And then observe if it resolves the problem.