Skip to content

Enable RackAttack rate-limiting in dry-run mode

Production Change

Change Summary

Per &341 (closed), and specifically scalability#656, we want to enable RackAttack on gitlab.com. First step is to enable it in Dry Run mode for the new throttles, so we can evaluate the accuracy and potential effectiveness before we affect users for real, and this Change Issue does so.

Initial numbers are derived in scalability#656 (comment 445997862) but are subject to change during and after this issue.

Change Details

  1. Services Impacted - ServiceGitLab Rails
  2. Change Technician - @cmiskell
  3. Change Criticality - C3
  4. Change Type - changescheduled
  5. Change Reviewer - @msmiley
  6. Due Date - 2020-11-20 01:10 UTC
  7. Time tracking - 3.25hrs
  8. Downtime Component - None

Detailed steps for the change

Pre-Change Steps - steps to be completed before execution of the change

Estimated Time to Complete (mins) - 0

Change Steps - steps to take to execute the change

Estimated Time to Complete (mins) - Staging: 30 minutes, Production 1.5 hr

Staging

  • Stop chef on all front-end nodes: knife ssh 'roles:gstg-base-fe-web OR roles:gstg-base-fe-api' "sudo systemctl stop chef-client"
  • Undraft and merge kubernetes MR (gitlab-com/gl-infra/k8s-workloads/gitlab-com!522 (merged)) and let it deploy
  • Undraft and merge chef MR (https://ops.gitlab.net/gitlab-cookbooks/chef-repo/-/merge_requests/4594) and wait for the apply_to_staging job to run
  • Run chef on the frontend web + API VMs using chatops (NB: cannot use gstg_base_fe because it contains git nodes; chef is intentionally disabled there))
  • /chatops run deploycmd chefclient role_gstg_base_fe_api --no-check
  • /chatops run deploycmd chefclient role_gstg_base_fe_web --no-check
  • At https://staging.gitlab.com/admin/application_settings/network, under "User and IP Rate Limits":
    • Enable unauthenticated request rate limit @ 600 requests per period, with period of 60 seconds
      • Lower than production, but also lower than the rate of automated unauth git and web traffic staging gets, which will trigger logs from both.
    • Enable authenticated API request rate limit @ 1000 requests per period, with period of 60 seconds
    • Enable authenticated web request rate limit @ 2000 requests per period, with period of 60 seconds
      • These two are much higher than we see in staging normally, so we can verify that this does not cause logging in the happy case.
  • Monitor per post-change steps
  • Adjust the threshold for unauthenticated requests to 1500 requests per period, to eliminate unnecessary logging from normal traffic.

Production

  • Stop chef on all front-end nodes: knife ssh 'roles:gprd-base-fe-web OR roles:gprd-base-fe-api' "sudo systemctl stop chef-client"
  • Undraft and merge kubernetes MR (gitlab-com/gl-infra/k8s-workloads/gitlab-com!523 (merged)) and let it deploy
  • Undraft and merge chef MR (https://ops.gitlab.net/gitlab-cookbooks/chef-repo/-/merge_requests/4595) and run the apply_to_production job when ready.
  • Run chef on the frontend web + API VMs using chatops (in parallel for slight efficiencies and better log retention on the CI job due to volume of output):
  • /chatops run deploycmd chefclient role_gprd_base_fe_api --skip-haproxy --no-check
  • /chatops run deploycmd chefclient role_gprd_base_fe_web --skip-haproxy --no-check
  • /chatops run deploycmd chefclient role_gprd_base_fe_git --no-check
  • At https://gitlab.com/admin/application_settings/network, under "User and IP Rate Limits":
    • Enable unauthenticated request rate limit @ 1500 requests per period, with period of 60 seconds
    • Enable authenticated API request rate limit @ 1000 requests per period, with period of 60 seconds
    • Enable authenticated web request rate limit @ 2000 requests per period, with period of 60 seconds
  • Monitor per post-change steps

Post-Change Steps - steps to take to verify the change

Estimated Time to Complete (mins) - Staging 10 minutes, Production 1 hour

Staging

  • https://nonprod-log.gitlab.net/goto/11de98b1b6840a1792e24875a03daf06 - we expect to see a few hundred logs a minute from two IP addresses at a time (one to web, one to git)
    • The web logs should occur from about 45 seconds in, until the top of the minute (roughly 800/minute total traffic)
    • The git logs should occur from about 30 seconds in, until the top of the minute (roughly 1200/minute total traffic)
    • They must have the 'dry run' indicator in the logs (env field set to track not throttle). If they are being blocked instead, this must be fixed before proceeding (possibly a problem with the name, value, or interpretation of the environment variable)

Significant variations from the expected behavior must be investigated. The time numbers above are rough estimates, and may vary slightly in terms of start time (up to 5 seconds perhaps), but should be very close to the top of the minute (1 or 2 seconds at most) for logs to stop.

Production

Rollback

Rollback steps - steps to be taken in the event of a need to rollback this change

Estimated Time to Complete (mins) - Immediate: 1minute, Full: 1.5hr

  • Disable the three checkboxes on https://gitlab.com/admin/application_settings/network, under "User and IP Rate Limits".
    • This should be sufficient to avoid the most likely problems
  • Revert and apply the relevant MRs (k8s and chef) to get rid of the environment variables. If urgency is required, running chef with "knife ssh -C 6" should be preferred; it will be fine on web + api, but will cause low-grade errors on the git fleet when puma is restarted (because git-over-ssh traffic won't drain from haproxy, and will try to use puma to authenticate). This should be weighed against the impact of the problem that required rollback.

Monitoring

Key metrics to observe

The log views linked in Post-Change steps are important and should be monitored.

  • Metric: Rate of 429's
    • Location: https://log.gprd.gitlab.net/goto/19ef2e04fc7b705391166c3eda37fc6d
    • What changes to this metric should prompt a rollback: Any substantial change. Note that there is variation over time, and Protected Paths is already enabled so some Rails requests are already being 429'd and changes in that traffic will show up in this graph, but the general shape/structure of that shouldn't radically change. If it does, we're limiting traffic unexpectedly (dry-run is not having the expected effect)

Summary of infrastructure changes

  • Does this change introduce new compute instances?
  • Does this change re-size any existing compute instances?
  • Does this change introduce any additional usage of tooling like Elastic Search, CDNs, Cloudflare, etc?

None.

Changes checklist

  • This issue has a criticality label (e.g. C1, C2, C3, C4) and a change-type label (e.g. changeunscheduled, changescheduled) based on the Change Management Criticalities.
  • This issue has the change technician as the assignee.
  • Pre-Change, Change, Post-Change, and Rollback steps and have been filled out and reviewed.
  • Necessary approvals have been completed based on the Change Management Workflow.
  • Change has been tested in staging and resultss noted in a comment on this issue.
  • A dry-run has been conducted and results noted in a comment on this issue.
  • SRE on-call has been informed prior to change being rolled out. (In #production channel, mention @sre-oncall and this issue.)
  • There are currently no active incidents.
Edited by Craig Miskell