Enable RackAttack rate-limiting in dry-run mode
Production Change
Change Summary
Per &341 (closed), and specifically scalability#656, we want to enable RackAttack on gitlab.com. First step is to enable it in Dry Run mode for the new throttles, so we can evaluate the accuracy and potential effectiveness before we affect users for real, and this Change Issue does so.
Initial numbers are derived in scalability#656 (comment 445997862) but are subject to change during and after this issue.
Change Details
- Services Impacted - ServiceGitLab Rails
- Change Technician - @cmiskell
- Change Criticality - C3
- Change Type - changescheduled
- Change Reviewer - @msmiley
- Due Date - 2020-11-20 01:10 UTC
- Time tracking - 3.25hrs
- Downtime Component - None
Detailed steps for the change
Pre-Change Steps - steps to be completed before execution of the change
Estimated Time to Complete (mins) - 0
-
scalability#629 has been completed and the relevant code deployed
Change Steps - steps to take to execute the change
Estimated Time to Complete (mins) - Staging: 30 minutes, Production 1.5 hr
Staging
-
Stop chef on all front-end nodes: knife ssh 'roles:gstg-base-fe-web OR roles:gstg-base-fe-api' "sudo systemctl stop chef-client"
-
Undraft and merge kubernetes MR (gitlab-com/gl-infra/k8s-workloads/gitlab-com!522 (merged)) and let it deploy -
Undraft and merge chef MR (https://ops.gitlab.net/gitlab-cookbooks/chef-repo/-/merge_requests/4594) and wait for the apply_to_staging job to run - Run chef on the frontend web + API VMs using chatops (NB: cannot use gstg_base_fe because it contains
git
nodes; chef is intentionally disabled there)) -
/chatops run deploycmd chefclient role_gstg_base_fe_api --no-check
-
/chatops run deploycmd chefclient role_gstg_base_fe_web --no-check
- At https://staging.gitlab.com/admin/application_settings/network, under "User and IP Rate Limits":
-
Enable unauthenticated request rate limit
@ 600 requests per period, with period of 60 seconds- Lower than production, but also lower than the rate of automated unauth
git
andweb
traffic staging gets, which will trigger logs from both.
- Lower than production, but also lower than the rate of automated unauth
-
Enable authenticated API request rate limit
@ 1000 requests per period, with period of 60 seconds -
Enable authenticated web request rate limit
@ 2000 requests per period, with period of 60 seconds- These two are much higher than we see in staging normally, so we can verify that this does not cause logging in the happy case.
-
-
Monitor per post-change steps -
Adjust the threshold for unauthenticated requests to 1500 requests per period, to eliminate unnecessary logging from normal traffic.
Production
-
Stop chef on all front-end nodes: knife ssh 'roles:gprd-base-fe-web OR roles:gprd-base-fe-api' "sudo systemctl stop chef-client"
-
Undraft and merge kubernetes MR (gitlab-com/gl-infra/k8s-workloads/gitlab-com!523 (merged)) and let it deploy -
Undraft and merge chef MR (https://ops.gitlab.net/gitlab-cookbooks/chef-repo/-/merge_requests/4595) and run the apply_to_production job when ready. - Run chef on the frontend web + API VMs using chatops (in parallel for slight efficiencies and better log retention on the CI job due to volume of output):
-
/chatops run deploycmd chefclient role_gprd_base_fe_api --skip-haproxy --no-check
-
/chatops run deploycmd chefclient role_gprd_base_fe_web --skip-haproxy --no-check
-
/chatops run deploycmd chefclient role_gprd_base_fe_git --no-check
- At https://gitlab.com/admin/application_settings/network, under "User and IP Rate Limits":
-
Enable unauthenticated request rate limit
@ 1500 requests per period, with period of 60 seconds -
Enable authenticated API request rate limit
@ 1000 requests per period, with period of 60 seconds -
Enable authenticated web request rate limit
@ 2000 requests per period, with period of 60 seconds
-
-
Monitor per post-change steps
Post-Change Steps - steps to take to verify the change
Estimated Time to Complete (mins) - Staging 10 minutes, Production 1 hour
Staging
-
https://nonprod-log.gitlab.net/goto/11de98b1b6840a1792e24875a03daf06 - we expect to see a few hundred logs a minute from two IP addresses at a time (one to web
, one togit
)-
The web
logs should occur from about 45 seconds in, until the top of the minute (roughly 800/minute total traffic) -
The git
logs should occur from about 30 seconds in, until the top of the minute (roughly 1200/minute total traffic) -
They must have the 'dry run' indicator in the logs ( env
field set totrack
notthrottle
). If they are being blocked instead, this must be fixed before proceeding (possibly a problem with the name, value, or interpretation of the environment variable)
-
Significant variations from the expected behavior must be investigated. The time numbers above are rough estimates, and may vary slightly in terms of start time (up to 5 seconds perhaps), but should be very close to the top of the minute (1 or 2 seconds at most) for logs to stop.
Production
-
https://log.gprd.gitlab.net/goto/0d88d3ac4438c029c0830e82c4fef23d - If any logs here are not noted as dry-run (
env
field set totrack
means dry-run;throttle
would be wrong), abort immediately and disable the checkboxes on https://gitlab.com/admin/application_settings/network, under "User and IP Rate Limits", before evaluating what has occurred. Given the time it takes to apply, strongly prefer fixing and roll-forward. - This view may require some adjustment to add useful fields. It's possible that nothing will log here, because our thresholds are high. For anything that logs, verify against total traffic for the user or IP address in the same minute (and surrounding, for context) whether the rate-limit/block would have been reasonable or not.
- If any logs here are not noted as dry-run (
-
https://log.gprd.gitlab.net/goto/19ef2e04fc7b705391166c3eda37fc6d Monitor rate of 429s.
Rollback
Rollback steps - steps to be taken in the event of a need to rollback this change
Estimated Time to Complete (mins) - Immediate: 1minute, Full: 1.5hr
-
Disable the three checkboxes on https://gitlab.com/admin/application_settings/network, under "User and IP Rate Limits". - This should be sufficient to avoid the most likely problems
-
Revert and apply the relevant MRs (k8s and chef) to get rid of the environment variables. If urgency is required, running chef with "knife ssh -C 6" should be preferred; it will be fine on web + api, but will cause low-grade errors on the git fleet when puma is restarted (because git-over-ssh traffic won't drain from haproxy, and will try to use puma to authenticate). This should be weighed against the impact of the problem that required rollback.
Monitoring
Key metrics to observe
The log views linked in Post-Change steps are important and should be monitored.
- Metric: Rate of 429's
- Location: https://log.gprd.gitlab.net/goto/19ef2e04fc7b705391166c3eda37fc6d
- What changes to this metric should prompt a rollback: Any substantial change. Note that there is variation over time, and Protected Paths is already enabled so some Rails requests are already being 429'd and changes in that traffic will show up in this graph, but the general shape/structure of that shouldn't radically change. If it does, we're limiting traffic unexpectedly (dry-run is not having the expected effect)
Summary of infrastructure changes
-
Does this change introduce new compute instances? -
Does this change re-size any existing compute instances? -
Does this change introduce any additional usage of tooling like Elastic Search, CDNs, Cloudflare, etc?
None.
Changes checklist
-
This issue has a criticality label (e.g. C1, C2, C3, C4) and a change-type label (e.g. changeunscheduled, changescheduled) based on the Change Management Criticalities. -
This issue has the change technician as the assignee. -
Pre-Change, Change, Post-Change, and Rollback steps and have been filled out and reviewed. -
Necessary approvals have been completed based on the Change Management Workflow. -
Change has been tested in staging and resultss noted in a comment on this issue. -
A dry-run has been conducted and results noted in a comment on this issue. -
SRE on-call has been informed prior to change being rolled out. (In #production channel, mention @sre-oncall
and this issue.) -
There are currently no active incidents.