Enforce RackAttack rate limiting on gitlab.com

Production Change

Change Summary

As one of the final major steps in &148 (and particularly the sub-epic &379 (closed)), take RackAttack rate-limiting on gitlab.com out of Dry Run mode and into Enforcing mode, with some related final adjustments

Change Details

  1. Services Impacted - ServiceGitLab Rails ServiceWeb ServiceAPI
  2. Change Technician - @cmiskell
  3. Change Criticality - C2 - while fairly minor in mechanics, and we've done a lot of prep work to try and ensure it will have no effect, this is still a substantial enough change that we blogged about it, so I'm bumping it to C2.
  4. Change Type - changescheduled
  5. Change Reviewer - @craig
  6. Due Date - 2021-01-18 00:30 UTC
  7. Time tracking - 1.5 hours
  8. Downtime Component - None

Detailed steps for the change

Change Steps - steps to take to execute the change

Estimated Time to Complete (mins) - 60 minutes

Post-Change Steps - steps to take to verify the change

Estimated Time to Complete (mins) - 15 minutes

  • Look for Rack Attack logs that are not blocklist (an internal slightly unusual usage of RackAttack that's already operating), not track (dry-run mode), and not from the Protected Paths matcher: https://log.gprd.gitlab.net/goto/c8f8d8b08e191a8cd735ae0003697485
    • We expect to see such logs with json.env=throttle, where they were 'track' before. They are regular; we expect to see thousands per hour even at quiet times.
    • Check that none are left in track mode; that may imply some Rails processes that have not restarted without the environment variable.

Rollback

Rollback steps - steps to be taken in the event of a need to rollback this change

Estimated Time to Complete (mins) - 15 minutes

If RackAttack itself is causing urgent problems, the quickest rollback is to turn it off entirely, using the 3 checkboxes at https://gitlab.com/admin/application_settings/network under 'User and IP Rate Limits' (leave Protected Paths alone). This is roughly equivalent to being in the Dry Run mode prior to this change, but without logging or even checking, so it actually reduces the load on redis in particular (a noticeable percentage) but also the Rails nodes. We could then carefully re-enable Dry Run mode before re-enabling the checkboxes.

Otherwise, the haproxy and Rack Attack changes can be considered lightly independent in the event of rollback, i.e. it might be reasonable to rollback only one but not the other, depending on the exact circumstances. It is somewaht unlikely to leave the haproxy change and turn off RackAttack, but the other way round is plausible, and rolling back both is definitely ok.

Rollback is to follow the same procedure, but with reverting MRs (the changes are all one-liners). In urgency, editing chef roles by hand (knife role edit) may be acceptable, followed by MRs to formalize.

It is also possibile to add IP addresses or users to bypass the limits (gitlab-haproxy.frontend.whitelist.api for IPs and omnibus-gitlab.gitlab_rb.gitlab-rails.env.GITLAB_THROTTLE_USER_ALLOWLIST for users); we would normally want some careful justification of why, but in the first blush of enabling the enforcement it might be acceptable to add one or two to this if it allows us to continue enforcing for everyone else. Know that this is possible, but hold it in reserve for emergencies.

Monitoring

Key metrics to observe

  • Metric: Rails web HTTP 429 responses
    • Location: Rack:Attack Rollout Dashboard
    • What changes to this metric should prompt a rollback: More than 5% of HTTP responses result in 429, with no obvious justification
  • Metric: RackAttack enforcement logs

Summary of infrastructure changes

  • Does this change introduce new compute instances?
  • Does this change re-size any existing compute instances?
  • Does this change introduce any additional usage of tooling like Elastic Search, CDNs, Cloudflare, etc?

None

Changes checklist

  • This issue has a criticality label (e.g. C1, C2, C3, C4) and a change-type label (e.g. changeunscheduled, changescheduled) based on the Change Management Criticalities.
  • This issue has the change technician as the assignee.
  • Pre-Change, Change, Post-Change, and Rollback steps and have been filled out and reviewed.
  • Necessary approvals have been completed based on the Change Management Workflow.
  • Change has been tested in staging and results noted in a comment on this issue.
  • A dry-run has been conducted and results noted in a comment on this issue.
  • SRE on-call has been informed prior to change being rolled out. (In #production channel, mention @sre-oncall and this issue and await their acknowledgement.)
  • There are currently no active incidents.
Edited by Craig Miskell