Deploy failure due to patch rollback caused excessive node draining
Corrective action for production#2330 (closed)
Summary
There is a bug in the deployer logic that was triggered by a fatal error related to patch handling. This caused a situation where the deploy left more servers in the DRAIN
state than it should.
The problem started with an empty directory that was prepped for a demo of the hotpatch tool for the staging environment, this empty directory didn't contain any patches, but caused an unexpected error for the production deployment:
The fatal task error was triggered with the following task that looks for patches in <empty-dir>
:
with_items: "{{ lookup('fileglob', '<empty-dir>/*').split(',') | sort(reverse=True) }}"
Which resulted in:
fatal: [web-02-sv-gprd.c.gitlab-production.internal]: FAILED! =>
msg: '''list object'' has no attribute ''split'''
Operations across the fleet are done in batches of 10%, if all servers in the batch failed the deploy would have stopped and we could have manually recovered. But, because we also group canary stage nodes with main stage nodes, we started out by operating on the following batches:
- batch1: web-01 / web-cny-01
- batch2: web-02 / web-cny-02
- batch3: web-03 / web-cny-03
- etc ...
As we ran through these batches, web nodes were being drained slowly until it reached a critical number which failed the deploy (8 VMs).
By default, Ansible will continue executing actions as long as there are hosts in the batch that have not yet failed.
So it looks like this is the expected behavior, though we can force failures early by setting maxiumum-failure-percentage
.
This problem can be reproduced with the following test play:
- name: test
gather_facts: false
hosts:
- web-01-sv-gstg.c.gitlab-staging-1.internal
- api-01-sv-gstg.c.gitlab-staging-1.internal
- web-02-sv-gstg.c.gitlab-staging-1.internal
- api-02-sv-gstg.c.gitlab-staging-1.internal
- web-03-sv-gstg.c.gitlab-staging-1.internal
- api-03-sv-gstg.c.gitlab-staging-1.internal
vars:
ansible_python_interpreter: "/usr/bin/env python"
serial: 50%
# max_fail_percentage: 0
pre_tasks:
- debug:
msg: Remove from LB
tasks:
- debug:
msg: hi
- fail:
msg: fail!
when:
- not 'web' in inventory_hostname_short
post_tasks:
- debug:
msg: Add to LB
Resolution
- Set
max_failure_percentage: 0
https://ops.gitlab.net/gitlab-com/gl-infra/deploy-tooling/-/merge_requests/282 - Ensure proper empty-directory handling https://ops.gitlab.net/gitlab-com/gl-infra/deploy-tooling/-/merge_requests/283