Prevent residual HAProxy processes by setting `hard-stop-after`

Production Change - Criticality 3 C3

Change Objective	Prevent residual HAProxy processes from lingering indefinitely after graceful reload/restart. Do this by configuring 30-minute deadline after which residual connections to the old HAProxy process will be closed.
Change Type	ConfigurationChange
Services Impacted	HAProxy. Adjacent upstream services should reconnect after HAProxy restarts
Change Team Members	@msmiley @jarv
Change Severity	C3
Buddy check or tested in staging	Both staging check and buddy check (need volunteer)
Schedule of the change	TBD: Date and time (with timezone)
Duration of the change	1 hour

Pre-production steps:

Merge the change to the gitlab-haproxy cookbook: gitlab-cookbooks/gitlab-haproxy!173 (merged) This implicitly creates a merge request in chef-repo to bump the version of the gitlab-haproxy cookbook to 0.1.149 in all environments, which we will merge later (after validation). UPDATE: Use the newer version 0.1.151. It includes newer unrelated but safe changes (security improvement for camoproxy and removal of obsolete role attributes that never reached prod).
Publish to Chef server the version bump for the gstg environment: knife environment from file environments/gstg.json
Set the feature flag to true in the chef-repo's gstg-base-lb role file, and apply it with chef-repo's bin/apply_chef_changes.
Choose an example host for testing in the gstg environment that is assigned a role that uses this cookbook: knife search -i 'roles:gstg-base-lb' | sort (e.g. fe-01-lb-gstg.c.gitlab-staging-1.internal)

Validate the change on the example host in the staging environment:

Run Chef client: sudo chef-client
Verify the config file has the expected change: sudo cat /etc/haproxy/haproxy.cfg | grep 'hard-stop-after'
Verify a new haproxy process has started (and probably at least one old process still exists): pgrep -u haproxy -f "/usr/sbin/haproxy" | xargs -r ps -o pid,lstart,etime,args --sort start_time
Verify the new haproxy process has established TCP connections: sudo netstat -atnp | grep -w "$HAPROXY_PID" | grep 'ESTABLISHED' | wc -l
Review the haproxy logs: sudo tail -f /var/log/haproxy.log

Apply the config change to all non-production environments (in this case, only gstg, pre):

Merge the chef-repo merge request. Its CI Pipeline will automatically publish to Chef server the new pinned version numbers for the updated cookbook. Separate pipeline jobs handle non-production (gstg, pre) vs. production (gprd, dr) environments, and the job for the production environments will wait for manual confirmation. Wait until the scheduled change window to let the pipeline apply the version bump to the production environments.
Set the feature flag to true in the other non-prod environments' chef-repo roles: {gstg,pre}-base-lb. Merge this as a merge-request, and use the standard chef-repo pipeline to apply.
Review grafana dashboards (although they may have no data for non-production environments): HAProxy and HAProxy Process Overview

Manually kill all residual haproxy processes in non-production environments:

Note: The new setting in haproxy.cfg will handle this for future maintenance, but the existing residual processes must be manually killed since they are not using that config setting.

Find all Chef nodes running the gitlab-haproxy cookbook. They all use role <env>-lb-base. for GENV in "gstg" "pre" ; do knife search "roles:${GENV}-base-lb" -a fqdn 2> /dev/null | awk '/fqdn:/ { print $2 }' | sort > ./host_list.$GENV ; done
For each host in gstg, run the clean up script: mussh -H ./host_list.gstg -C ./kill_residual_haproxy_processes.sh
For each host in pre, run the clean up script: mussh -H ./host_list.pre -C ./kill_residual_haproxy_processes.sh

Production apply steps:

~~Run chef-client in the production environments (gprd and dr) by clicking the manual "apply_to_prod" step in the pipeline for the chef-repo merge request that ran after merging that MR.~~
~~Set the feature flag to true in the other prod environments: https://ops.gitlab.net/gitlab-cookbooks/chef-repo/merge_requests/1875.~~ - Reverted due to :https://gitlab.com/gitlab-com/gl-infra/infrastructure/issues/7535
https://ops.gitlab.net/gitlab-cookbooks/chef-repo/merge_requests/1877

Manually kill all residual haproxy processes in non-production environments:

Note: The new setting in haproxy.cfg will handle this for future maintenance, but the existing residual processes must be manually killed since they are not using that config setting.

Find all Chef nodes running the gitlab-haproxy cookbook. They all use role <env>-lb-base. for GENV in "gprd" "dr" ; do knife search "roles:${GENV}-base-lb" -a fqdn 2> /dev/null | awk '/fqdn:/ { print $2 }' | sort > ./host_list.$GENV ; done
For each host in gprd, run the clean up script: mussh -H ./host_list.gprd -C ./kill_residual_haproxy_processes.sh
For each host in dr, run the clean up script: mussh -H ./host_list.dr -C ./kill_residual_haproxy_processes.sh

Rollback steps:

Revert the chef-repo merge request to rollback the cookbook version number pinned in each environment, as documented in our chef runbook.
Expedite the rollback by manually running Chef client on all affected hosts: knife ssh 'roles:gprd-base-lb' 'sudo chef-client'

Clean-up script:

This script is meant to be run locally on each haproxy host to find and kill any residual haproxy processes. It should leave exactly 2 processes running /usr/sbin/haproxy: one owned by user root and the other owned by user haproxy.

Clean-up script kill_residual_haproxy_processes.sh:

Tested via DRYRUN mode on all flavors of haproxy recipe.

#!/usr/bin/env bash
# Find PID of the youngest "haproxy" process.  It should be the active listener.
ACTIVE_LISTENER_PID=$( pgrep -u haproxy -f "/usr/sbin/haproxy" | xargs -r ps -o pid= --sort start_time | tail -n1 )
# Confirm it is the active listener bound to at least one TCP port.
sudo netstat -ltpn 2> /dev/null | grep -q -w "${ACTIVE_LISTENER_PID}/haproxy" || ( echo "Aborting! Youngest process is not listening." && exit 1 )
# Kill any residual processes.
for RESIDUAL_PID in $( pgrep -u haproxy -f "/usr/sbin/haproxy" | xargs -r ps -o pid= --sort start_time | head -n-1 )
do
    NUM_CONNECTIONS=$( sudo netstat -atpn | grep -c -w "${RESIDUAL_PID}/haproxy" )
    if [[ -n "$DRYRUN" ]] ; then
        echo "DRY RUN: Would kill residual haproxy PID $RESIDUAL_PID ($NUM_CONNECTIONS connections)"
    else
        echo "Killing residual haproxy PID $RESIDUAL_PID ($NUM_CONNECTIONS connections)"
        sudo kill $RESIDUAL_PID
    fi
done

Edited Sep 23, 2019 by John Skarbek