Prevent residual HAProxy processes by setting `hard-stop-after`
C3
Production Change - Criticality 3Change Objective | Prevent residual HAProxy processes from lingering indefinitely after graceful reload/restart. Do this by configuring 30-minute deadline after which residual connections to the old HAProxy process will be closed. |
---|---|
Change Type | ConfigurationChange |
Services Impacted | HAProxy. Adjacent upstream services should reconnect after HAProxy restarts |
Change Team Members | @msmiley @jarv |
Change Severity | C3 |
Buddy check or tested in staging | Both staging check and buddy check (need volunteer) |
Schedule of the change | TBD: Date and time (with timezone) |
Duration of the change | 1 hour |
Pre-production steps:
-
Merge the change to the gitlab-haproxy
cookbook: gitlab-cookbooks/gitlab-haproxy!173 (merged) This implicitly creates a merge request inchef-repo
to bump the version of thegitlab-haproxy
cookbook to0.1.149
in all environments, which we will merge later (after validation). UPDATE: Use the newer version0.1.151
. It includes newer unrelated but safe changes (security improvement for camoproxy and removal of obsolete role attributes that never reached prod). -
Publish to Chef server the version bump for the gstg
environment:knife environment from file environments/gstg.json
-
Set the feature flag to true in the chef-repo's gstg-base-lb role file, and apply it with chef-repo's bin/apply_chef_changes
. -
Choose an example host for testing in the gstg
environment that is assigned a role that uses this cookbook:knife search -i 'roles:gstg-base-lb' | sort
(e.g.fe-01-lb-gstg.c.gitlab-staging-1.internal
)
Validate the change on the example host in the staging environment:
-
Run Chef client: sudo chef-client
-
Verify the config file has the expected change: sudo cat /etc/haproxy/haproxy.cfg | grep 'hard-stop-after'
-
Verify a new haproxy process has started (and probably at least one old process still exists): pgrep -u haproxy -f "/usr/sbin/haproxy" | xargs -r ps -o pid,lstart,etime,args --sort start_time
-
Verify the new haproxy process has established TCP connections: sudo netstat -atnp | grep -w "$HAPROXY_PID" | grep 'ESTABLISHED' | wc -l
-
Review the haproxy logs: sudo tail -f /var/log/haproxy.log
Apply the config change to all non-production environments (in this case, only gstg
, pre
):
-
Merge the chef-repo
merge request. Its CI Pipeline will automatically publish to Chef server the new pinned version numbers for the updated cookbook. Separate pipeline jobs handle non-production (gstg
,pre
) vs. production (gprd
,dr
) environments, and the job for the production environments will wait for manual confirmation. Wait until the scheduled change window to let the pipeline apply the version bump to the production environments. -
Set the feature flag to true in the other non-prod environments' chef-repo roles: {gstg,pre}-base-lb. Merge this as a merge-request, and use the standard chef-repo pipeline to apply. -
Review grafana dashboards (although they may have no data for non-production environments): HAProxy and HAProxy Process Overview
Manually kill all residual haproxy processes in non-production environments:
Note: The new setting in haproxy.cfg will handle this for future maintenance, but the existing residual processes must be manually killed since they are not using that config setting.
-
Find all Chef nodes running the gitlab-haproxy
cookbook. They all use role<env>-lb-base
.for GENV in "gstg" "pre" ; do knife search "roles:${GENV}-base-lb" -a fqdn 2> /dev/null | awk '/fqdn:/ { print $2 }' | sort > ./host_list.$GENV ; done
-
For each host in gstg
, run the clean up script:mussh -H ./host_list.gstg -C ./kill_residual_haproxy_processes.sh
-
For each host in pre
, run the clean up script:mussh -H ./host_list.pre -C ./kill_residual_haproxy_processes.sh
Production apply steps:
-
Run chef-client in the production environments (gprd
anddr
) by clicking the manual "apply_to_prod" step in the pipeline for thechef-repo
merge request that ran after merging that MR. -
Set the feature flag to true in the other prod environments: https://ops.gitlab.net/gitlab-cookbooks/chef-repo/merge_requests/1875.- Reverted due to :https://gitlab.com/gitlab-com/gl-infra/infrastructure/issues/7535 -
https://ops.gitlab.net/gitlab-cookbooks/chef-repo/merge_requests/1877
Manually kill all residual haproxy processes in non-production environments:
Note: The new setting in haproxy.cfg will handle this for future maintenance, but the existing residual processes must be manually killed since they are not using that config setting.
-
Find all Chef nodes running the gitlab-haproxy
cookbook. They all use role<env>-lb-base
.for GENV in "gprd" "dr" ; do knife search "roles:${GENV}-base-lb" -a fqdn 2> /dev/null | awk '/fqdn:/ { print $2 }' | sort > ./host_list.$GENV ; done
-
For each host in gprd
, run the clean up script:mussh -H ./host_list.gprd -C ./kill_residual_haproxy_processes.sh
-
For each host in dr
, run the clean up script:mussh -H ./host_list.dr -C ./kill_residual_haproxy_processes.sh
Rollback steps:
-
Revert the chef-repo
merge request to rollback the cookbook version number pinned in each environment, as documented in our chef runbook. -
Expedite the rollback by manually running Chef client on all affected hosts: knife ssh 'roles:gprd-base-lb' 'sudo chef-client'
Clean-up script:
This script is meant to be run locally on each haproxy host to find and kill any residual haproxy processes. It should leave exactly 2 processes running /usr/sbin/haproxy
: one owned by user root
and the other owned by user haproxy
.
Clean-up script kill_residual_haproxy_processes.sh
:
Tested via DRYRUN mode on all flavors of haproxy recipe.
#!/usr/bin/env bash
# Find PID of the youngest "haproxy" process. It should be the active listener.
ACTIVE_LISTENER_PID=$( pgrep -u haproxy -f "/usr/sbin/haproxy" | xargs -r ps -o pid= --sort start_time | tail -n1 )
# Confirm it is the active listener bound to at least one TCP port.
sudo netstat -ltpn 2> /dev/null | grep -q -w "${ACTIVE_LISTENER_PID}/haproxy" || ( echo "Aborting! Youngest process is not listening." && exit 1 )
# Kill any residual processes.
for RESIDUAL_PID in $( pgrep -u haproxy -f "/usr/sbin/haproxy" | xargs -r ps -o pid= --sort start_time | head -n-1 )
do
NUM_CONNECTIONS=$( sudo netstat -atpn | grep -c -w "${RESIDUAL_PID}/haproxy" )
if [[ -n "$DRYRUN" ]] ; then
echo "DRY RUN: Would kill residual haproxy PID $RESIDUAL_PID ($NUM_CONNECTIONS connections)"
else
echo "Killing residual haproxy PID $RESIDUAL_PID ($NUM_CONNECTIONS connections)"
sudo kill $RESIDUAL_PID
fi
done