Enable RackAttack rate-limiting in dry-run mode

Production Change

Change Summary

Per &341 (closed), and specifically scalability#656, we want to enable RackAttack on gitlab.com. First step is to enable it in Dry Run mode for the new throttles, so we can evaluate the accuracy and potential effectiveness before we affect users for real, and this Change Issue does so.

Initial numbers are derived in scalability#656 (comment 445997862) but are subject to change during and after this issue.

Change Details

Services Impacted - ServiceGitLab Rails
Change Technician - @cmiskell
Change Criticality - C3
Change Type - changescheduled
Change Reviewer - @msmiley
Due Date - 2020-11-20 01:10 UTC
Time tracking - 3.25hrs
Downtime Component - None

Detailed steps for the change

Pre-Change Steps - steps to be completed before execution of the change

Estimated Time to Complete (mins) - 0

scalability#629 has been completed and the relevant code deployed

Change Steps - steps to take to execute the change

Estimated Time to Complete (mins) - Staging: 30 minutes, Production 1.5 hr

Staging

Production

Post-Change Steps - steps to take to verify the change

Estimated Time to Complete (mins) - Staging 10 minutes, Production 1 hour

Staging

https://nonprod-log.gitlab.net/goto/11de98b1b6840a1792e24875a03daf06 - we expect to see a few hundred logs a minute from two IP addresses at a time (one to web, one to git)
- The web logs should occur from about 45 seconds in, until the top of the minute (roughly 800/minute total traffic)
- The git logs should occur from about 30 seconds in, until the top of the minute (roughly 1200/minute total traffic)
- They must have the 'dry run' indicator in the logs (env field set to track not throttle). If they are being blocked instead, this must be fixed before proceeding (possibly a problem with the name, value, or interpretation of the environment variable)

Significant variations from the expected behavior must be investigated. The time numbers above are rough estimates, and may vary slightly in terms of start time (up to 5 seconds perhaps), but should be very close to the top of the minute (1 or 2 seconds at most) for logs to stop.

Production

https://log.gprd.gitlab.net/goto/0d88d3ac4438c029c0830e82c4fef23d
- If any logs here are not noted as dry-run (env field set to track means dry-run; throttle would be wrong), abort immediately and disable the checkboxes on https://gitlab.com/admin/application_settings/network, under "User and IP Rate Limits", before evaluating what has occurred. Given the time it takes to apply, strongly prefer fixing and roll-forward.
- This view may require some adjustment to add useful fields. It's possible that nothing will log here, because our thresholds are high. For anything that logs, verify against total traffic for the user or IP address in the same minute (and surrounding, for context) whether the rate-limit/block would have been reasonable or not.
https://log.gprd.gitlab.net/goto/19ef2e04fc7b705391166c3eda37fc6d Monitor rate of 429s.

Rollback

Rollback steps - steps to be taken in the event of a need to rollback this change

Estimated Time to Complete (mins) - Immediate: 1minute, Full: 1.5hr

Disable the three checkboxes on https://gitlab.com/admin/application_settings/network, under "User and IP Rate Limits".
- This should be sufficient to avoid the most likely problems
Revert and apply the relevant MRs (k8s and chef) to get rid of the environment variables. If urgency is required, running chef with "knife ssh -C 6" should be preferred; it will be fine on web + api, but will cause low-grade errors on the git fleet when puma is restarted (because git-over-ssh traffic won't drain from haproxy, and will try to use puma to authenticate). This should be weighed against the impact of the problem that required rollback.

Monitoring

Key metrics to observe

The log views linked in Post-Change steps are important and should be monitored.

Metric: Rate of 429's
- Location: https://log.gprd.gitlab.net/goto/19ef2e04fc7b705391166c3eda37fc6d
- What changes to this metric should prompt a rollback: Any substantial change. Note that there is variation over time, and Protected Paths is already enabled so some Rails requests are already being 429'd and changes in that traffic will show up in this graph, but the general shape/structure of that shouldn't radically change. If it does, we're limiting traffic unexpectedly (dry-run is not having the expected effect)

Summary of infrastructure changes

Does this change introduce new compute instances?
Does this change re-size any existing compute instances?
Does this change introduce any additional usage of tooling like Elastic Search, CDNs, Cloudflare, etc?

None.

Changes checklist

This issue has a criticality label (e.g. C1, C2, C3, C4) and a change-type label (e.g. changeunscheduled, changescheduled) based on the Change Management Criticalities.
This issue has the change technician as the assignee.
Pre-Change, Change, Post-Change, and Rollback steps and have been filled out and reviewed.
Necessary approvals have been completed based on the Change Management Workflow.
Change has been tested in staging and resultss noted in a comment on this issue.
A dry-run has been conducted and results noted in a comment on this issue.
SRE on-call has been informed prior to change being rolled out. (In #production channel, mention @sre-oncall and this issue.)
There are currently no active incidents.

Edited Nov 20, 2020 by Craig Miskell