Add server state persistence to preserve maint status across reloads

Problem

When HAProxy reloads, it loses all runtime state including server maintenance mode. This causes canary servers that were drained (put into maintenance) to become undrained, sending unexpected traffic to canary.

Related incidents:

Solution

Uses HAProxy's built-in server-state-file and load-server-state-from-file to save and restore server state on reload. A systemd override dumps state via show servers state before each reload.

This preserves:

  • Drain/maintenance mode (MAINT/FDRAIN)
  • Health check status (UP/DOWN)
  • Server weights

Previous issue with stale IPs

This feature was previously removed due to gitlab-com/gl-infra/production-engineering#12421 (closed) - state file IPs were restored even when config IPs changed, causing 503s.

Fix: Each server now has init-addr <config-ip>, forcing HAProxy to always use the config IP rather than the state file IP.

Configuration

Disabled by default. To enable:

node['gitlab-haproxy']['server_state_persistence']['enable'] = true

References

Edited by Igor

Merge request reports

Loading