2022-04-06: HAProxy zone sysctl network test us-east1-b

Production Change

Change Summary

Test increasing TCP connection limits in HAProxy nodes to handle bursts of traffic (e.g. zone failover).

All HAProxy nodes in us-east1-b.

Change Details

Services Impacted - ServiceHAProxy
Change Technician - @f_santos
Change Reviewer - @pguinoiseau
Time tracking - unknown
Downtime Component - none

Detailed steps for the change

Pre-Change Steps - steps to be completed before execution of the change

Estimated Time to Complete (mins) -

Set label changein-progress on this issue

Change Steps - steps to take to execute the change

Estimated Time to Complete (mins) - 120m

Disable chef-client knife ssh 'roles:gprd-base-lb AND gce_instance_zone:projects/805818759045/zones/us-east1-b' 'chef-client-disable https://gitlab.com/gitlab-com/gl-infra/production/-/issues/6768'
Set sysctl values:

knife ssh 'roles:gprd-base-lb AND gce_instance_zone:projects/805818759045/zones/us-east1-b' sysctl net.core.default_qdisc=fq_codel; sysctl net.core.rmem_max=16777216; sysctl net.core.rmem_default=524288; sysctl net.core.wmem_max=16777216; sysctl net.core.wmem_default=524288; sysctl net.core.optmem_max=16777216; sysctl net.core.somaxconn=32768; sysctl net.core.netdev_max_backlog=131072; sysctl net.ipv4.tcp_congestion_control=bbr; sysctl net.ipv4.tcp_max_syn_backlog=32768; sysctl net.ipv4.tcp_notsent_lowat=16384; sysctl net.ipv4.tcp_rmem="4096 131072 16777216"; sysctl net.ipv4.tcp_wmem="4096 65536 16777216"; sysctl net.ipv4.tcp_fastopen=3; sysctl net.ipv4.tcp_slow_start_after_idle=0; sysctl net.ipv4.ip_local_port_range="8193 60999";

Obverse metrics in here, values to look for:
- Latency
- Error increase
- Number of connections
- Incoming/outgoing bytes
Run for a few hours
Enable chef-client chef-client-enable
Set label changecomplete on this issue

Post-Change Steps - steps to take to verify the change

Estimated Time to Complete (mins) -

Rollback

Rollback steps - steps to be taken in the event of a need to rollback this change

Estimated Time to Complete (mins) - 1m

Set previous values

sysctl net.core.default_qdisc=fq_codel;
sysctl net.core.rmem_max=524287;
sysctl net.core.rmem_default=524287;
sysctl net.core.wmem_max=524287;
sysctl net.core.wmem_default=524287;
sysctl net.core.optmem_max=524287;
sysctl net.core.somaxconn=1024;
sysctl net.core.netdev_max_backlog=300000;
sysctl net.ipv4.tcp_congestion_control=cubic;
sysctl net.ipv4.tcp_max_syn_backlog=2048;
sysctl net.ipv4.tcp_notsent_lowat=4294967295;
sysctl net.ipv4.tcp_rmem="4096 131072 6291456";
sysctl net.ipv4.tcp_wmem="4096 16384 4194304";
sysctl net.ipv4.tcp_fastopen=1;
sysctl net.ipv4.tcp_slow_start_after_idle=1;
sysctl net.ipv4.ip_local_port_range="32768 60999";

Monitoring

Key metrics to observe

Metric: latency and 5xx errors
- Location: dashboard
- What changes to this metric should prompt a rollback: Increase in errors/latency

Summary of infrastructure changes

Does this change introduce new compute instances?
Does this change re-size any existing compute instances?
Does this change introduce any additional usage of tooling like Elastic Search, CDNs, Cloudflare, etc?

Summary of the above

Change Reviewer checklist

C4 C3 C2 C1:

The scheduled day and time of execution of the change is appropriate.
The change plan is technically accurate.
The change plan includes estimated timing values based on previous testing.
The change plan includes a viable rollback plan.
The specified metrics/monitoring dashboards provide sufficient visibility for the change.

C2 C1:

The complexity of the plan is appropriate for the corresponding risk of the change. (i.e. the plan contains clear details).
The change plan includes success measures for all steps/milestones during the execution.
The change adequately minimizes risk within the environment/service.
The performance implications of executing the change are well-understood and documented.
The specified metrics/monitoring dashboards provide sufficient visibility for the change. - If not, is it possible (or necessary) to make changes to observability platforms for added visibility?
The change has a primary and secondary SRE with knowledge of the details available during the change window.

Change Technician checklist

Edited Apr 06, 2022 by Filipe Santos