Detect nodes left in drain after deploy
Occasionally we attempt to deploy to production and there's an haproxy node which has some backends left in state DRAIN
. While the root cause needs to be determined, this currently slows down the deployment process as we must stop and investigate during the intent to deploy, the status of the haproxy node.
Utilize this issue to consider adding a job to the completion state that outputs the state of all backends for all haproxy nodes. Do not let the job fail, but use this as historical determination if deployer may be assisting in causing this issue.
The last occurrence of this was an attempt to deploy to canary, where fe-05 had many nodes left in DRAIN
:
- https://gitlab.com/gitlab-com/gl-infra/infrastructure/issues/8293#note_255240100
- https://ops.gitlab.net/gitlab-com/gl-infra/deployer/-/jobs/775357
I suspect that this behavior is different than prior where haproxy had long running processes: https://gitlab.com/gitlab-com/gl-infra/infrastructure/issues/6959 This has since been resolved, and I'm not seeing this same behavior of fe-05.
/cc @jarv