Digital Ocean droplet being destroyed
It seems that I did figure out the reason why we have a small number of builds run on Shared Runners.
After doing ps auxf on runner machine I see this:
root 17758 0.0 0.2 510976 19836 ? Sl 15:57 0:00 \_ docker-machine provision runner-4e4528ca-machine-1490024913-7e43d82c-digital-ocean-4gb
root 17763 0.0 0.2 215792 19052 ? Sl 15:57 0:00 | \_ /usr/bin/docker-machine
root 29849 0.0 0.0 44788 3720 ? S 16:00 0:00 | \_ /usr/bin/ssh -F /dev/null -o PasswordAuthentication=no -o StrictHostKeyChecking=no -o UserKnownHostsFile=/dev/null -o LogLevel=quiet -o ConnectionAttempts=3 -o ConnectTimeout=10 -o ControlMaster=no
root 18916 0.0 0.2 373372 18724 ? Sl 15:57 0:00 \_ docker-machine provision runner-4e4528ca-machine-1490015520-d9ea6854-digital-ocean-4gb
root 18922 0.0 0.2 215792 17056 ? Sl 15:57 0:00 | \_ /usr/bin/docker-machine
root 30377 0.0 0.0 44916 5252 ? S 16:00 0:00 | \_ /usr/bin/ssh -F /dev/null -o PasswordAuthentication=no -o StrictHostKeyChecking=no -o UserKnownHostsFile=/dev/null -o LogLevel=quiet -o ConnectionAttempts=3 -o ConnectTimeout=10 -o ControlMaster=no
Which indicate that we have a number of machines that are being created, but failed to be created so we try to provision them. However, since the creation failed on Digital Ocean, it seems that now the machines do not longer have "created" status, but they are removed. Previously Digital Ocean would create a new machine after some time.
$ docker-machine config runner-4e4528ca-machine-1490024913-7e43d82c-digital-ocean-4gb
Error running connection boilerplate: GET https://api.digitalocean.com/v2/droplets/42994147: 404 The resource you were accessing could not be found.
This basically confirms the assumption. Going then do Digital Ocean panel:

Simply, this machine was failed to create and it is now marked as destroyed.
However, the GitLab Runner is built in very conservative approach:
- run
docker-machine create, - if fails 3x run
docker-machine provision.
If we run docker-machine provision in such case, the docker-machine doesn't check the presence of VM:
docker-machine provision runner-4e4528ca-machine-1490024913-7e43d82c-digital-ocean-4gb
Waiting for SSH to be available...
...
Since this VM did not get created, and there's more than likely that there's nothing on that IP we are hanging in the air.
Each docker-machine provision takes about 5-6 minutes to finish. This means that "creating slot" is blocked for 6 minutes (for docker-machine create) + 3x6 minutes (for provision). Thus we are unable to provision enough machines in reasonable time. This leads to a few used machines as on this graph: https://performance.gitlab.net/dashboard/db/ci?panelId=28&fullscreen.
I have a hotfix for that, that we can put into cron:
#!/bin/bash
ps h -o pid,args -C "docker-machine" | grep "docker-machine provision runner-" | while read MACHINE_PID MACHINE_CMD MACHINE_ARG MACHINE_NAME REST; do
if docker-machine config "$MACHINE_NAME" | grep "404 The resource you were accessing could"; then
echo "pid: $MACHINE_PID, cmd: $MACHINE_CMD, arg: $MACHINE_ARG, machine: $MACHINE_NAME, rest: $rest"
docker-machine rm -y "$MACHINE_NAME"
kill "$MACHINE_PID"
fi
done
It looks for all processes that provision checks if the machine is destroyed, and synchronises the status with the local system, also unblocking GitLab Runner.