2022-04-06: HAProxy zone sysctl network test us-east1-b
Production Change
Change Summary
Test increasing TCP connection limits in HAProxy nodes to handle bursts of traffic (e.g. zone failover).
All HAProxy nodes in us-east1-b
.
Change Details
- Services Impacted - ServiceHAProxy
-
Change Technician -
@f_santos
- Change Reviewer - @pguinoiseau
- Time tracking - unknown
- Downtime Component - none
Detailed steps for the change
Pre-Change Steps - steps to be completed before execution of the change
Estimated Time to Complete (mins) -
-
Set label changein-progress on this issue
Change Steps - steps to take to execute the change
Estimated Time to Complete (mins) - 120m
-
Disable chef-client knife ssh 'roles:gprd-base-lb AND gce_instance_zone:projects/805818759045/zones/us-east1-b' 'chef-client-disable https://gitlab.com/gitlab-com/gl-infra/production/-/issues/6768'
-
Set sysctl values:
knife ssh 'roles:gprd-base-lb AND gce_instance_zone:projects/805818759045/zones/us-east1-b' sysctl net.core.default_qdisc=fq_codel; sysctl net.core.rmem_max=16777216; sysctl net.core.rmem_default=524288; sysctl net.core.wmem_max=16777216; sysctl net.core.wmem_default=524288; sysctl net.core.optmem_max=16777216; sysctl net.core.somaxconn=32768; sysctl net.core.netdev_max_backlog=131072; sysctl net.ipv4.tcp_congestion_control=bbr; sysctl net.ipv4.tcp_max_syn_backlog=32768; sysctl net.ipv4.tcp_notsent_lowat=16384; sysctl net.ipv4.tcp_rmem="4096 131072 16777216"; sysctl net.ipv4.tcp_wmem="4096 65536 16777216"; sysctl net.ipv4.tcp_fastopen=3; sysctl net.ipv4.tcp_slow_start_after_idle=0; sysctl net.ipv4.ip_local_port_range="8193 60999";
-
Obverse metrics in here, values to look for: - Latency
- Error increase
- Number of connections
- Incoming/outgoing bytes
-
Run for a few hours -
Enable chef-client chef-client-enable
-
Set label changecomplete on this issue
Post-Change Steps - steps to take to verify the change
Estimated Time to Complete (mins) -
Rollback
Rollback steps - steps to be taken in the event of a need to rollback this change
Estimated Time to Complete (mins) - 1m
-
Set previous values
sysctl net.core.default_qdisc=fq_codel;
sysctl net.core.rmem_max=524287;
sysctl net.core.rmem_default=524287;
sysctl net.core.wmem_max=524287;
sysctl net.core.wmem_default=524287;
sysctl net.core.optmem_max=524287;
sysctl net.core.somaxconn=1024;
sysctl net.core.netdev_max_backlog=300000;
sysctl net.ipv4.tcp_congestion_control=cubic;
sysctl net.ipv4.tcp_max_syn_backlog=2048;
sysctl net.ipv4.tcp_notsent_lowat=4294967295;
sysctl net.ipv4.tcp_rmem="4096 131072 6291456";
sysctl net.ipv4.tcp_wmem="4096 16384 4194304";
sysctl net.ipv4.tcp_fastopen=1;
sysctl net.ipv4.tcp_slow_start_after_idle=1;
sysctl net.ipv4.ip_local_port_range="32768 60999";
Monitoring
Key metrics to observe
- Metric: latency and 5xx errors
- Location: dashboard
- What changes to this metric should prompt a rollback: Increase in errors/latency
Summary of infrastructure changes
-
Does this change introduce new compute instances? -
Does this change re-size any existing compute instances? -
Does this change introduce any additional usage of tooling like Elastic Search, CDNs, Cloudflare, etc?
Summary of the above
Change Reviewer checklist
-
The scheduled day and time of execution of the change is appropriate. -
The change plan is technically accurate. -
The change plan includes estimated timing values based on previous testing. -
The change plan includes a viable rollback plan. -
The specified metrics/monitoring dashboards provide sufficient visibility for the change.
-
The complexity of the plan is appropriate for the corresponding risk of the change. (i.e. the plan contains clear details). -
The change plan includes success measures for all steps/milestones during the execution. -
The change adequately minimizes risk within the environment/service. -
The performance implications of executing the change are well-understood and documented. -
The specified metrics/monitoring dashboards provide sufficient visibility for the change. - If not, is it possible (or necessary) to make changes to observability platforms for added visibility? -
The change has a primary and secondary SRE with knowledge of the details available during the change window.
Change Technician checklist
-
This issue has a criticality label (e.g. C1, C2, C3, C4) and a change-type label (e.g. changeunscheduled, changescheduled) based on the Change Management Criticalities. -
This issue has the change technician as the assignee. -
Pre-Change, Change, Post-Change, and Rollback steps and have been filled out and reviewed. -
This Change Issue is linked to the appropriate Issue and/or Epic -
Necessary approvals have been completed based on the Change Management Workflow. -
Change has been tested in staging and results noted in a comment on this issue. -
A dry-run has been conducted and results noted in a comment on this issue. -
SRE on-call has been informed prior to change being rolled out. (In #production channel, mention @sre-oncall
and this issue and await their acknowledgement.) -
Release managers have been informed (If needed! Cases include DB change) prior to change being rolled out. (In #production channel, mention @release-managers
and this issue and await their acknowledgment.) -
There are currently no active incidents.
Edited by Filipe Santos