How to assess a distributed problem down the GitLab runner autoscaler chain with Docker Machine and Apache CloudStack?

(copied from gitlab-org/gitlab-runner#4700 (closed) as it possibly rather belongs here)

Summary

We have an enterprise mission critical setup to process a couple hundreds of pipelines every day.

The problem we experience is recurring but quite hard to pin down. Ideally we would have extended logs in every component to understand what is happening. Possibly we are going to setup a test system to have latest either intrumental patches to narrow down the problem, but we can't wait for upcoming releases of all components.

Maybe other community members have same setup, it would be great to exchange and join forces.

System components: GitLab with autoscaling Gitlab Shared Runners using Docker Machine with Apache Cloud Stack driver, while the deeper backend is Apache Cloud Stack based on a set of VMWare clusters.

The problem which we can observe: Docker Machine as it gets managed by the combination of GitLab, GitLab Runner autoscaling runner service and the driver would randomly fail to correctly configure virtual machines for pipelines and after all through them away. There is a range of messages like it would not be able to connect to the API because the Docker endpoint IP is empty, there would be a bad certificate either there is a timeout. There is also a set of other symptoms and sub-problems, while we can't observe a hardware overload of the GitLab Runner autoscaling service machine, neither the problems the Docker Machine reports are real in terms that Cloud Stack produces virtual machines which are available as we validate network configuration.

Steps to reproduce

Configure and run the autoscaling service following the documentation, for Docker Machine with Apache Cloud Stack driver.
Observe docker-machine ls output and also the system journal.

Example Project

n/a

What is the current bug behavior?

Docker Machine reports bogus problems and destroys virtual machines all the time to build them up in a never ending cycle. Not every runner gets actually a chance to boot up and take on a pipeline.

What is the expected correct behavior?

Docker Machine does not report bogus problem. GitLab Runner service acts in a more smoother way with more log verbosity and pre-/post condition assessments and validations. If fast fail okay but not for vague unconfirmed reasons.

Relevant logs and/or screenshots

Typical output of docker-machine ls:

NAME                                         ACTIVE   DRIVER       STATE     URL                       SWARM   DOCKER    ERRORS
runner-nypcrwwr-gitlab-1568213119-c9c26eb9   -        cloudstack   Running   tcp://:2376                       Unknown   Unable to query docker version: Cannot connect to the docker engine endpoint
runner-nypcrwwr-gitlab-1568238664-b8d84212   -        cloudstack   Running   tcp://:2376                       Unknown   Unable to query docker version: Cannot connect to the docker engine endpoint
runner-nypcrwwr-gitlab-1568267146-18ef201a   -        cloudstack   Running   tcp://:2376                       Unknown   Unable to query docker version: Cannot connect to the docker engine endpoint
runner-nypcrwwr-gitlab-1568267305-4767cc47   -        cloudstack   Running   tcp://172.16.1.179:2376           Unknown   Unable to query docker version: Cannot connect to the docker engine endpoint
runner-nypcrwwr-gitlab-1568267308-1a2b1c71   -        cloudstack   Running   tcp://172.16.1.202:2376           Unknown   Unable to query docker version: Cannot connect to the docker engine endpoint

The strange Docker URL with an empty IP tcp://:2376 is possibly a sync related bug which comes from the driver. Another day you would see the error message remote error: tls: bad certificate. Another day there is the error "no route to host" which is strange enough because as we validate networking (port scan) it seems okay.

Error we observe in the system log journal of the autoscaling service are:

ERROR: Error removing host "runner-nypcrwwr-gitlab-1567707443-692c3c42": CloudStack API error 431 (CSExceptionErrorCode: 4350): A key pair with name 'runner-nypcrwwr-gitlab-1567707443-692c3c42' 
does not exist for account xxx in specified domain id  name=runner-nypcrwwr-gitlab-1567707443-692c3c42 operation=remove

WARNING: Error while stopping machine               
error=exit status 1 lifetime=16h4m45.539556761s name=runner-nypcrwwr-gitlab-1567707443-692c3c42 
reason=machine is unavailable used=16h2m36.48030586s usedCount=0

Here checks show that they are about no more existing cloud assets which the autoscaling system itself had decided to remove previously.

Strange enough, there is a mismatch between what Docker Machine sees and what Apache Cloud Monkey sees from same machine, run directly after the above docker machine ls:

| runner-nypcrwwr-gitlab-1568267308-1a2b1c71 | Running  |
| runner-nypcrwwr-gitlab-1568267305-4767cc47 | Running  |
| runner-nypcrwwr-gitlab-1568267146-18ef201a | Error    |
| runner-nypcrwwr-gitlab-1568213119-c9c26eb9 | Error    |

Output of checks

n/a

Results of GitLab environment info

Latest versions of all named components. / to be done

Results of GitLab application Check

to be done

Possible fixes

We have thought it could be a good idea to reduce the problem to just Docker Machine creating lots of VMs. It turns out that there is no working way to pass a timeout configuration to the driver for async Apache CloudStack jobs. Thus we've created a patch of the driver with the default timeout value changed to 1 hour instead of 5 minutes. Strange enough, the patch worked in this reduced setup very well, but had no effect in the complete context of the production system.

We see also a problem beyond a single component perspective, where every component's team has an easy job to say, it would be other component's problem, but we think that there is some problem how the components interact which summarizes to what we have got. The "fix" would be how to achieve a fast turnover to instrumentalize components and debug the system, faster than jumping from one component release to another, maybe by following master branch upstream builds in a reference setup.

Further symptoms

Sometimes, a VM would fail because the apt-get update call causes a network error, maybe if mirrors think we try to DDoS them.
Sometimes, a VM would go into an "Error" state which then Docker Machine recognizes correctly. This we clarify with our Apache CloudStack support team.
If we restart the service, chances are it will work for a while more or less but then stuck completely.

Related issues on GitHub

Related issues on GitLab

Edited Nov 13, 2019 by Tamara