Monitor provisioned machines on machine
We should do continuous cross-check between state of docker-machine and state of digital ocean.
I have seen in past occasion when we had larger amount of machines provisioned on Digital Ocean, than accounted in docker-machine locally.
This did happen, because of API problem on Digital Ocean part (failure to create a new machines, then a failure to remove them, as machines were in 422 created state).
We should write a script that would take:
- runner token,
- runner digital ocean API token,
- do API call to digital ocean to list all machines,
- check if machines returned by API do still exist locally,
- remove machines that are not in use.
We should run this script probably every one hour, on each of managers and try to create a report when we detect a situation that machines are removed.
We should probably do the same for all machines that are failing (machines without DockerID assigned in /root/.docker/machines/machines/<machine-name>/config.json), which are currently removed by /root/machines-operation.sh remove-failing script). If we would run that every hour it would basically delete problematic entries. In case of remove-failing we should ensure that we delete entries that are probably more than 1 hour old, as it is still possible that this machine is being created now, by docker-machine.
For solving problem 1. we could use doctl with something like this:
ls -1 /root/.docker/machines/machines > machines.txt
while read DID DNAME DIP DREST; do
if grep -q "$DNAME" machines.txt; then
continue
fi
echo "Removing $DNAME of $DID..."
doctl compute droplet delete "$DID" &
done < <(doctl compute droplet list | grep "\trunner-${runner_token_stripped_to_8_characters_from_config_toml})
wait
For solving problem 2., probably some script with find and -mtime to look for files that were modified more than 1 hour ago.
@maratkalibek @tmaczukin What do you think?