Replace takeoff with a different ssh orchestration tool
Over time the takeoff repo has been adapted as a complex orchestration tool for deploying to different services in the GitLab fleet. To do this it requires the following features:
- pre and post actions before deploys
- fleet discovery
- load balancer draining
- metric collection and reporting
- configurable concurrency
- configurable retries
- resumes for failures
- run from GitLab chatops (non-interactive)
Due to it being a home-grown tool some of these features are not fully implemented or not implemented at all. As we move the deploy logic to GitLab chatops these features will become even more important as there will not be a human driving the deploy process with the ability to reason about failures and how to work around problems.
This issue is to primarily discuss moving the deploy logic to Ansible. A simple PoC was built to show how this may look like with Ansible as the orchestration tool - https://gitlab.com/gitlab-com/gl-infra/takeoff-ansible-poc
- Speed and simplicity - DSL on top of simple YAML
- We frequently see ssh connection timeouts with takeoff, this gives us fine-tuned controls over retries.
- Easier to create pre and post checks for changes, avoid unnecessary restarts.
- Task retries for failures
- Built in support for apt and many modules
- Excellent support for rolling updates, much more configurable than
knife ssh ...- https://docs.ansible.com/ansible/latest/user_guide/playbooks_delegation.html#rolling-update-batch-size
- Plenty of options for integrating with slack, and pushgateway for metrics throughout the deployment process
- If we continue to need to apply patch files it makes much more sense to use the same tool.
- This will be a very fast drop-in replacement
- takeoff is a known quantity
- ansible is Python based
- it may not be something we use at all for k8 deploys