Enforce RackAttack rate limiting on gitlab.com
Production Change
Change Summary
As one of the final major steps in &148 (and particularly the sub-epic &379 (closed)), take RackAttack rate-limiting on gitlab.com out of Dry Run mode and into Enforcing mode, with some related final adjustments
Change Details
- Services Impacted - ServiceGitLab Rails ServiceWeb ServiceAPI
- Change Technician - @cmiskell
- Change Criticality - C2 - while fairly minor in mechanics, and we've done a lot of prep work to try and ensure it will have no effect, this is still a substantial enough change that we blogged about it, so I'm bumping it to C2.
- Change Type - changescheduled
- Change Reviewer - @craig
- Due Date - 2021-01-18 00:30 UTC
- Time tracking - 1.5 hours
- Downtime Component - None
Detailed steps for the change
Change Steps - steps to take to execute the change
Estimated Time to Complete (mins) - 60 minutes
-
Notify #support_gitlab-com in Slack that this change is in progress. - Adjust the haproxy API rate-limit per scalability#732
-
Undraft, (approve), and merge https://ops.gitlab.net/gitlab-cookbooks/chef-repo/-/merge_requests/4842, and run the apply_to_production job when available -
Run chef on the front-end haproxy nodes to adjust the limit: knife ssh -C3 'roles:gprd-base-lb-fe' "sudo chef-client"
-
- Remove the dry-run flags so that RackAttack is in enforcing mode. In parallel:
- K8s
-
Undraft and merge gitlab-com/gl-infra/k8s-workloads/gitlab-com!636 (merged), and let it deploy
-
- VMs
-
Stop chef on all front-end nodes: knife ssh 'roles:gprd-base-fe-web OR roles:gprd-base-fe-api' "sudo systemctl stop chef-client" -
Undraft, (approve), and merge https://ops.gitlab.net/gitlab-cookbooks/chef-repo/-/merge_requests/4843, and run the apply_to_production job when available -
/chatops run deploycmd chefclient role_gprd_base_fe_api --skip-haproxy --no-check -
/chatops run deploycmd chefclient role_gprd_base_fe_web --skip-haproxy --no-check
-
- K8s
Post-Change Steps - steps to take to verify the change
Estimated Time to Complete (mins) - 15 minutes
-
Look for Rack Attack logs that are not blocklist(an internal slightly unusual usage of RackAttack that's already operating), nottrack(dry-run mode), and not from the Protected Paths matcher: https://log.gprd.gitlab.net/goto/c8f8d8b08e191a8cd735ae0003697485- We expect to see such logs with json.env=
throttle, where they were 'track' before. They are regular; we expect to see thousands per hour even at quiet times. - Check that none are left in
trackmode; that may imply some Rails processes that have not restarted without the environment variable.
- We expect to see such logs with json.env=
Rollback
Rollback steps - steps to be taken in the event of a need to rollback this change
Estimated Time to Complete (mins) - 15 minutes
If RackAttack itself is causing urgent problems, the quickest rollback is to turn it off entirely, using the 3 checkboxes at https://gitlab.com/admin/application_settings/network under 'User and IP Rate Limits' (leave Protected Paths alone). This is roughly equivalent to being in the Dry Run mode prior to this change, but without logging or even checking, so it actually reduces the load on redis in particular (a noticeable percentage) but also the Rails nodes. We could then carefully re-enable Dry Run mode before re-enabling the checkboxes.
Otherwise, the haproxy and Rack Attack changes can be considered lightly independent in the event of rollback, i.e. it might be reasonable to rollback only one but not the other, depending on the exact circumstances. It is somewaht unlikely to leave the haproxy change and turn off RackAttack, but the other way round is plausible, and rolling back both is definitely ok.
Rollback is to follow the same procedure, but with reverting MRs (the changes are all one-liners). In urgency, editing chef roles by hand (knife role edit) may be acceptable, followed by MRs to formalize.
It is also possibile to add IP addresses or users to bypass the limits (gitlab-haproxy.frontend.whitelist.api for IPs and omnibus-gitlab.gitlab_rb.gitlab-rails.env.GITLAB_THROTTLE_USER_ALLOWLIST for users); we would normally want some careful justification of why, but in the first blush of enabling the enforcement it might be acceptable to add one or two to this if it allows us to continue enforcing for everyone else. Know that this is possible, but hold it in reserve for emergencies.
Monitoring
Key metrics to observe
- Metric: Rails web HTTP 429 responses
- Location: Rack:Attack Rollout Dashboard
- What changes to this metric should prompt a rollback: More than 5% of HTTP responses result in 429, with no obvious justification
- Metric: RackAttack enforcement logs
- Location: https://log.gprd.gitlab.net/goto/c8f8d8b08e191a8cd735ae0003697485
- What changes to this metric should prompt a rollback: Massive amounts of throttling events, in obvious excess of the NB: a lack of any throttle logs suggests the change hasn't applied and is worth investigating but not rolling back immediately
Summary of infrastructure changes
-
Does this change introduce new compute instances? -
Does this change re-size any existing compute instances? -
Does this change introduce any additional usage of tooling like Elastic Search, CDNs, Cloudflare, etc?
None
Changes checklist
-
This issue has a criticality label (e.g. C1, C2, C3, C4) and a change-type label (e.g. changeunscheduled, changescheduled) based on the Change Management Criticalities. -
This issue has the change technician as the assignee. -
Pre-Change, Change, Post-Change, and Rollback steps and have been filled out and reviewed. -
Necessary approvals have been completed based on the Change Management Workflow. -
Change has been tested in staging and results noted in a comment on this issue. -
A dry-run has been conducted and results noted in a comment on this issue. -
SRE on-call has been informed prior to change being rolled out. (In #production channel, mention @sre-oncalland this issue and await their acknowledgement.) -
There are currently no active incidents.