Enforce RackAttack rate limiting on gitlab.com

Production Change

Change Summary

As one of the final major steps in &148 (and particularly the sub-epic &379 (closed)), take RackAttack rate-limiting on gitlab.com out of Dry Run mode and into Enforcing mode, with some related final adjustments

Change Details

Services Impacted - ServiceGitLab Rails ServiceWeb ServiceAPI
Change Technician - @cmiskell
Change Criticality - C2 - while fairly minor in mechanics, and we've done a lot of prep work to try and ensure it will have no effect, this is still a substantial enough change that we blogged about it, so I'm bumping it to C2.
Change Type - changescheduled
Change Reviewer - @craig
Due Date - 2021-01-18 00:30 UTC
Time tracking - 1.5 hours
Downtime Component - None

Detailed steps for the change

Change Steps - steps to take to execute the change

Estimated Time to Complete (mins) - 60 minutes

Notify #support_gitlab-com in Slack that this change is in progress.
Adjust the haproxy API rate-limit per scalability#732
- Undraft, (approve), and merge https://ops.gitlab.net/gitlab-cookbooks/chef-repo/-/merge_requests/4842, and run the apply_to_production job when available
- Run chef on the front-end haproxy nodes to adjust the limit: knife ssh -C3 'roles:gprd-base-lb-fe' "sudo chef-client"
Remove the dry-run flags so that RackAttack is in enforcing mode. In parallel:
- K8s
  - Undraft and merge gitlab-com/gl-infra/k8s-workloads/gitlab-com!636 (merged), and let it deploy
- VMs
  - Stop chef on all front-end nodes: knife ssh 'roles:gprd-base-fe-web OR roles:gprd-base-fe-api' "sudo systemctl stop chef-client"
  - Undraft, (approve), and merge https://ops.gitlab.net/gitlab-cookbooks/chef-repo/-/merge_requests/4843, and run the apply_to_production job when available
  - /chatops run deploycmd chefclient role_gprd_base_fe_api --skip-haproxy --no-check
  - /chatops run deploycmd chefclient role_gprd_base_fe_web --skip-haproxy --no-check

Post-Change Steps - steps to take to verify the change

Estimated Time to Complete (mins) - 15 minutes

Look for Rack Attack logs that are not blocklist (an internal slightly unusual usage of RackAttack that's already operating), not track (dry-run mode), and not from the Protected Paths matcher: https://log.gprd.gitlab.net/goto/c8f8d8b08e191a8cd735ae0003697485
- We expect to see such logs with json.env=throttle, where they were 'track' before. They are regular; we expect to see thousands per hour even at quiet times.
- Check that none are left in track mode; that may imply some Rails processes that have not restarted without the environment variable.

Rollback

Rollback steps - steps to be taken in the event of a need to rollback this change

Estimated Time to Complete (mins) - 15 minutes

If RackAttack itself is causing urgent problems, the quickest rollback is to turn it off entirely, using the 3 checkboxes at https://gitlab.com/admin/application_settings/network under 'User and IP Rate Limits' (leave Protected Paths alone). This is roughly equivalent to being in the Dry Run mode prior to this change, but without logging or even checking, so it actually reduces the load on redis in particular (a noticeable percentage) but also the Rails nodes. We could then carefully re-enable Dry Run mode before re-enabling the checkboxes.

Otherwise, the haproxy and Rack Attack changes can be considered lightly independent in the event of rollback, i.e. it might be reasonable to rollback only one but not the other, depending on the exact circumstances. It is somewaht unlikely to leave the haproxy change and turn off RackAttack, but the other way round is plausible, and rolling back both is definitely ok.

Rollback is to follow the same procedure, but with reverting MRs (the changes are all one-liners). In urgency, editing chef roles by hand (knife role edit) may be acceptable, followed by MRs to formalize.

It is also possibile to add IP addresses or users to bypass the limits (gitlab-haproxy.frontend.whitelist.api for IPs and omnibus-gitlab.gitlab_rb.gitlab-rails.env.GITLAB_THROTTLE_USER_ALLOWLIST for users); we would normally want some careful justification of why, but in the first blush of enabling the enforcement it might be acceptable to add one or two to this if it allows us to continue enforcing for everyone else. Know that this is possible, but hold it in reserve for emergencies.

Monitoring

Key metrics to observe

Metric: Rails web HTTP 429 responses
- Location: Rack:Attack Rollout Dashboard
- What changes to this metric should prompt a rollback: More than 5% of HTTP responses result in 429, with no obvious justification
Metric: RackAttack enforcement logs
- Location: https://log.gprd.gitlab.net/goto/c8f8d8b08e191a8cd735ae0003697485
- What changes to this metric should prompt a rollback: Massive amounts of throttling events, in obvious excess of the NB: a lack of any throttle logs suggests the change hasn't applied and is worth investigating but not rolling back immediately

Summary of infrastructure changes

Does this change introduce new compute instances?
Does this change re-size any existing compute instances?
Does this change introduce any additional usage of tooling like Elastic Search, CDNs, Cloudflare, etc?

None

Changes checklist

This issue has a criticality label (e.g. C1, C2, C3, C4) and a change-type label (e.g. changeunscheduled, changescheduled) based on the Change Management Criticalities.
This issue has the change technician as the assignee.
Pre-Change, Change, Post-Change, and Rollback steps and have been filled out and reviewed.
Necessary approvals have been completed based on the Change Management Workflow.
Change has been tested in staging and results noted in a comment on this issue.
A dry-run has been conducted and results noted in a comment on this issue.
SRE on-call has been informed prior to change being rolled out. (In #production channel, mention @sre-oncall and this issue and await their acknowledgement.)
There are currently no active incidents.

Edited Jan 18, 2021 by Craig Miskell

Assignee Loading

Time tracking Loading