[gstg] Use new redis-ratelimiting instance for Application Rate Limiting
Production Change
Change Summary
As part of &526 (closed), specifically the first active step of scalability#1249 (closed), we want to start using the newly created Redis for Rate-limiting (R4R) cluster for Application Rate Limiting.
This change is for staging and exists to write & test the process for the equivalent production change, and is a full change issue because it includes deploying and testing a new connection to Redis not just flipping a feature flag.
Change Details
- Services Impacted - ServiceRedis
- Change Technician - @cmiskell
- Change Reviewer - @smcgivern
- Time tracking - 35min
- Downtime Component - None
Detailed steps for the change
Pre-Change Steps - steps to be completed before execution of the change
Estimated Time to Complete (mins) - 1 min
-
Ensure that the gitlab_chart_version bump MR has been merged and applied -
Ensure that gitlab-org/gitlab!71196 (merged) has been merged and deployed to at least staging (i.e. scalability#1247 (closed) is functionally complete) - Obtain review/approval on:
-
Set label changein-progress on this issue
Change Steps - steps to take to execute the change
Estimated Time to Complete (mins) - 30 minutes
-
Merge https://gitlab.com/gitlab-com/gl-infra/chef-repo/-/merge_requests/664 -
From a local copy of chef-repo, run ./bin/gkms-vault-edit gitlab-omnibus-secrets gstg
and add an entry forredis_rate_limiting_instance
alongside the existing redis configs; use the same password, just adjust the identifier at the end. -
Merge gitlab-com/gl-infra/k8s-workloads/gitlab-com!1268 (merged) - NB: depends on the chef MR being merged first to have the expected effect; do not re-arrange the order.
-
Monitor/wait for the k8s pipeline to complete the staging deploy; it is ok to continue with the rest of the change while the QA-on-staging and production jobs are running. - Verify the connection:
-
Check for the connection in the webservice configmap in k8s. In an SSH shell on the console server: kubectl --cluster gke_gitlab-staging-1_us-east1-d_gstg-us-east1-d -n gitlab describe configmap gitlab-webservice|grep redis-rate
-
Start a shell on the current redis primary and run: sudo gitlab-redis-cli monitor |grep --line-buffered ping
-
-
In a Rails console, execute Gitlab::Redis::RateLimiting.with { | r | r.ping }
- We expect a "PONG" result, and to see the ping in the output of the monitor. If the monitor output is not seen the connection may still be to the SharedState. Try executing (in the console)
GitLab::Redis::RateLimiting::config_file_name
to see what config file is in use. -
Repeat this but in a console started ( /srv/gitlab/bin/rails console
) from within a Rails docker container (ssh to a node + docker exec into the container). - Failure to see the expected traffic to the correct Redis instance is grounds for aborting this change issue.
- We expect a "PONG" result, and to see the ping in the output of the monitor. If the monitor output is not seen the connection may still be to the SharedState. Try executing (in the console)
-
Set the feature flag use_rate_limiting_store_for_application_rate_limiter
to "true" on staging:/chatops run feature set use_rate_limiting_store_for_application_rate_limiter true --staging
Post-Change Steps - steps to take to verify the change
Estimated Time to Complete (mins) - 2 minutes
-
Start a shell on the current redis primary and run: sudo gitlab-redis-cli monitor|grep application_rate_limiter
-
Add a note to any issue on any project on staging. Expect to see an incr
forapplication_rate_limiter:notes_create:user:<your user id>
, and possibly anexpire
as well, in the output of the monitoring session. The presence of this confirms that the application is now using the new instance to store this rate-limiting info.
Rollback
Rollback steps - steps to be taken in the event of a need to rollback this change
Estimated Time to Complete (mins) - 30 min
-
Disable the feature flag, if we got that far ( /chatops run feature set use_rate_limiting_store_for_application_rate_limiter false --staging
) -
If necessary (the presence of the configuration is the problem), revert the k8s and chef MRs and apply.
Monitoring
Key metrics to observe
- Metric: Redis operation rates
- Location: https://dashboards.gitlab.net/d/redis-ratelimiting-main/redis-ratelimiting-overview?viewPanel=77&orgId=1&var-PROMETHEUS_DS=Global&var-environment=gstg&from=now-1h&to=now
- What changes to this metric should prompt a rollback: Not seeing the expected operations (incr/expire) reflected in the graph. Baseline operations are for replication, but incr and expire are not used, and should show up (briefly/sporadically and at low rates) when this functionality is used.
Summary of infrastructure changes
-
Does this change introduce new compute instances? -
Does this change re-size any existing compute instances? -
Does this change introduce any additional usage of tooling like Elastic Search, CDNs, Cloudflare, etc?
None
Changes checklist
-
This issue has a criticality label (e.g. C1, C2, C3, C4) and a change-type label (e.g. changeunscheduled, changescheduled) based on the Change Management Criticalities. -
This issue has the change technician as the assignee. -
Pre-Change, Change, Post-Change, and Rollback steps and have been filled out and reviewed. -
This Change Issue is linked to the appropriate Issue and/or Epic -
Necessary approvals have been completed based on the Change Management Workflow. -
Change has been tested in staging and results noted in a comment on this issue. -
A dry-run has been conducted and results noted in a comment on this issue. -
SRE on-call has been informed prior to change being rolled out. (In #production channel, mention @sre-oncall
and this issue and await their acknowledgement.) -
Release managers have been informed (If needed! Cases include DB change) prior to change being rolled out. (In #production channel, mention @release-managers
and this issue and await their acknowledgment.) -
There are currently no active incidents.
Edited by Craig Miskell