Deployer leave servers in DRAIN during some deployment failure scenarios

Problem Statement

Deployer will sometimes fail to install a package when a server suffers some issue. Lately disk issues, for example. As deployer operates on 10% of the fleet, this is roughly 3-4 servers per batch. If a deployment fails on one of the servers in a batch, but the install completes just fine on the other three servers, Ansible stops immediately. This fails the deployment, and notifies persons appropriately, which is precisely what we want. The problem comes in that since the package install worked just fine, on those three nodes, but Ansible didn't put them back into rotation yet, the next time deployer is run, deployer skips those nodes entirely. This prevents deployer from pushing nodes back into rotation. Therefore nodes are accidentally left in DRAIN permanently until something trips an alert. Usually an apdex or cpu saturation. At this point, the GitLab services are not restarted, or HUP'd, and the nodes are left in state DRAIN. This currently requires manual remediation. While disk space issues is the root cause of this, we should invest some time in making deployer a bit smarter about this scenario.

Reference:

  1. During install of 14.1.202106280321-fbba57a146e.3cb341774a8 (https://ops.gitlab.net/gitlab-com/gl-infra/deployer/-/pipelines/669459)
  2. Web deployer was running on nodes 13-16: https://ops.gitlab.net/gitlab-com/gl-infra/deployer/-/jobs/4208663
    • We failed on node 13 due to disk space.
    • We succeeded on nodes 14-16
    • This left all 4 nodes in state DRAIN
  3. Retrying a deploy after node 13 is remediated: https://ops.gitlab.net/gitlab-com/gl-infra/deployer/-/jobs/4209366
    • We installed the new package on node 13 and moved it to READY
    • nodes 14-16 were skipped because their package was already upgraded, so they remained in DRAIN
    • Deployer was running on nodes 21-24
    • node 22 ran out of disk space
    • install succeeded on nodes 21, 23, and 24
    • disk space was manually fixed on node 22
  4. We then retried the deploy yet again: https://ops.gitlab.net/gitlab-com/gl-infra/deployer/-/jobs/4209494
    • Note that nodes 21, 23, and 24 were essentially skipped because they successfully upgraded, so they remained in DRAIN
  5. In the end, nodes 14, 15, 16, 21, 23, and 24 where left in state DRAIN (15% of the fleet)
    • This requires a HUP and an HAProxy READY state change to bring these servers back into rotation

Solutions

The proposed solution uses @jarv's idea of simply removing the check for the package install. More details in the MR: https://ops.gitlab.net/gitlab-com/gl-infra/deploy-tooling/-/merge_requests/352

Edited by Amy Phillips