Deploy failure due to patch rollback caused excessive node draining

Corrective action for production#2330 (closed)

Summary

There is a bug in the deployer logic that was triggered by a fatal error related to patch handling. This caused a situation where the deploy left more servers in the DRAIN state than it should.

The problem started with an empty directory that was prepped for a demo of the hotpatch tool for the staging environment, this empty directory didn't contain any patches, but caused an unexpected error for the production deployment:

The fatal task error was triggered with the following task that looks for patches in <empty-dir>:

  with_items: "{{ lookup('fileglob', '<empty-dir>/*').split(',') | sort(reverse=True) }}"

Which resulted in:

fatal: [web-02-sv-gprd.c.gitlab-production.internal]: FAILED! => 
  msg: '''list object'' has no attribute ''split'''

Operations across the fleet are done in batches of 10%, if all servers in the batch failed the deploy would have stopped and we could have manually recovered. But, because we also group canary stage nodes with main stage nodes, we started out by operating on the following batches:

batch1: web-01 / web-cny-01
batch2: web-02 / web-cny-02
batch3: web-03 / web-cny-03
etc ...

As we ran through these batches, web nodes were being drained slowly until it reached a critical number which failed the deploy (8 VMs).

By default, Ansible will continue executing actions as long as there are hosts in the batch that have not yet failed.

https://docs.ansible.com/ansible/latest/user_guide/playbooks_delegation.html#maximum-failure-percentage

So it looks like this is the expected behavior, though we can force failures early by setting maxiumum-failure-percentage.

This problem can be reproduced with the following test play:

- name: test
  gather_facts: false
  hosts:
    - web-01-sv-gstg.c.gitlab-staging-1.internal
    - api-01-sv-gstg.c.gitlab-staging-1.internal
    - web-02-sv-gstg.c.gitlab-staging-1.internal
    - api-02-sv-gstg.c.gitlab-staging-1.internal
    - web-03-sv-gstg.c.gitlab-staging-1.internal
    - api-03-sv-gstg.c.gitlab-staging-1.internal
  vars:
    ansible_python_interpreter: "/usr/bin/env python"
  serial: 50%
  # max_fail_percentage: 0
  pre_tasks:
    - debug:
        msg: Remove from LB
  tasks:
    - debug:
        msg: hi
    - fail:
        msg: fail!
      when:
       - not 'web' in inventory_hostname_short
  post_tasks:
    - debug:
        msg: Add to LB

Resolution

Set max_failure_percentage: 0 https://ops.gitlab.net/gitlab-com/gl-infra/deploy-tooling/-/merge_requests/282
Ensure proper empty-directory handling https://ops.gitlab.net/gitlab-com/gl-infra/deploy-tooling/-/merge_requests/283

Edited Jun 30, 2020 by John Jarvis