Add server state persistence to preserve maint status across reloads
Problem
When HAProxy reloads, it loses all runtime state including server maintenance mode. This causes canary servers that were drained (put into maintenance) to become undrained, sending unexpected traffic to canary.
Related incidents:
- https://app.incident.io/gitlab/incidents/4457
- https://app.incident.io/gitlab/incidents/4485
- gitlab-com/gl-infra/production-engineering#12421 (comment 2405314246)
Solution
Uses HAProxy's built-in server-state-file and load-server-state-from-file to save and restore server state on reload. A systemd override dumps state via show servers state before each reload.
This preserves:
- Drain/maintenance mode (MAINT/FDRAIN)
- Health check status (UP/DOWN)
- Server weights
Previous issue with stale IPs
This feature was previously removed due to gitlab-com/gl-infra/production-engineering#12421 (closed) - state file IPs were restored even when config IPs changed, causing 503s.
Fix: Each server now has init-addr <config-ip>, forcing HAProxy to always use the config IP rather than the state file IP.
Configuration
Disabled by default. To enable:
node['gitlab-haproxy']['server_state_persistence']['enable'] = true