First sidekiq routing rules using exclude_from_gitlab_com on staging
Staging Change
Change Summary
For scalability#1072 (closed) as part of changing the catchall
sidekiq shard to use a single queue (default
), we want to exercise the routing rules in a safe manner. The exclude_from_gitlab_com tagged workers are convenient because:
- They only run on staging currently (for historical performance reasons, soon to be eliminated)
- Include Geo jobs which operate extremely regularly so we'll get good data quickly
- Includes Chaos::SleepWorker which does nothing but sleep, which we can use as a manually dispatched job for testing
Change Details
- Services Impacted - ServiceRedis
- Change Technician - @cmiskell
- Change Criticality - C3
- Change Type - changescheduled
- Change Reviewer - Self reviewed: staging only, will develop/enhance the process and get a review for production.
- Due Date - 2021-05-26 23:00 UTC
- Time tracking - 1hr
- Downtime Component - None
Detailed steps for the change
Pre-Change Steps - steps to be completed before execution of the change
Estimated Time to Complete (mins) - 1 minute
-
Obtain reviews and approvals on https://gitlab.com/gitlab-com/gl-infra/chef-repo/-/merge_requests/21 and gitlab-com/gl-infra/k8s-workloads/gitlab-com!879 (merged) -
Set label changein-progress on this issue
Change Steps - steps to take to execute the change
Estimated Time to Complete (mins) - 40 minutes
- In parallel:
-
Merge https://gitlab.com/gitlab-com/gl-infra/chef-repo/-/merge_requests/21 and wait 35 minutes for chef to run across the staging VM fleet. -
Merge gitlab-com/gl-infra/k8s-workloads/gitlab-com!879 (merged) and wait for it to apply to the staging kubernetes fleet
-
Post-Change Steps - steps to take to verify the change
Estimated Time to Complete (mins) - 30 minutes
-
Check https://sentry.gitlab.net/gitlab/staginggitlabcom/?query=is%3Aunresolved+InvalidRoutingRuleError to ensure no routing rule errors are being reported; if they are, rollback immediately. -
In the sidekiq logs verify that Geo jobs are running ( Geo::RepositoryVerification::Primary::SingleWorker
being the most common, at a rate of several per second), and that the queue they're running in is 'default', not the per-worker queue derived from their class name. If they are not running at all (look for job_status ofdone
) then rollback. If they are, but are still in their queue, debug the situation and only rollback if there is no obvious resolution after a reasonable initial investigation.- NB: This will not be possible to test in production as we're not running Geo there.
- Run a Chaos::SleepWorker job manually as well:
-
In a Rails console, run Chaos::SleepWorker.perform_async(0)
-
In the logs verify that it ran, with queue default
, notchaos:chaos_sleep
- NB: this will be possible to replicate in the production version of this change, when we also stop excluding
exclude_from_gitlab_com
jobs
-
Rollback
Rollback steps - steps to be taken in the event of a need to rollback this change
Estimated Time to Complete (mins) - 30 minutes
Monitoring
Key metrics to observe
- Metric: Queueing
- Location: https://dashboards.gitlab.net/d/sidekiq-main/sidekiq-overview?orgId=1&var-PROMETHEUS_DS=Global&var-environment=gstg&var-stage=main
- In the 'Sidekiq Queues' section, the
Sidekiq Aggregated Queue length
andSidekiq Queue Lengths per Queue
panels. - What changes to this metric should prompt a rollback: Any higher queuing than normal; queuing is usually transient
- In the 'Sidekiq Queues' section, the
- Location: https://dashboards.gitlab.net/d/sidekiq-main/sidekiq-overview?orgId=1&var-PROMETHEUS_DS=Global&var-environment=gstg&var-stage=main
- Metric: RPS
- Location: https://dashboards.gitlab.net/d/sidekiq-main/sidekiq-overview?viewPanel=975161318&orgId=1&var-PROMETHEUS_DS=Global&var-environment=gstg&var-stage=main
- What changes to this metric should prompt a rollback: Any inexplicable drop; the normal pattern is very consistent, and any drop will be reasonably obvious.
- This would indicate that jobs are not completing processing, which is probably also going to show up in the queue lengths.
- Location: https://dashboards.gitlab.net/d/sidekiq-main/sidekiq-overview?viewPanel=975161318&orgId=1&var-PROMETHEUS_DS=Global&var-environment=gstg&var-stage=main
- Metric: Shard details:
- Location: https://dashboards.gitlab.net/d/sidekiq-shard-detail/sidekiq-shard-detail?viewPanel=17&orgId=1&from=now-2h&to=now&var-PROMETHEUS_DS=Global&var-environment=gstg&var-stage=main&var-shard=All
- What changes to this metric should prompt a rollback: The distribution of work across shards changing in an inexplicable way. This link is also helpful, as it excludes catchall making the data for the lower-rate shards more discernible. Hide charts
- Location: https://dashboards.gitlab.net/d/sidekiq-shard-detail/sidekiq-shard-detail?viewPanel=17&orgId=1&from=now-2h&to=now&var-PROMETHEUS_DS=Global&var-environment=gstg&var-stage=main&var-shard=All
Summary of infrastructure changes
-
Does this change introduce new compute instances? -
Does this change re-size any existing compute instances? -
Does this change introduce any additional usage of tooling like Elastic Search, CDNs, Cloudflare, etc?
None
Changes checklist
-
This issue has a criticality label (e.g. C1, C2, C3, C4) and a change-type label (e.g. changeunscheduled, changescheduled) based on the Change Management Criticalities. -
This issue has the change technician as the assignee. -
Pre-Change, Change, Post-Change, and Rollback steps and have been filled out and reviewed. -
Necessary approvals have been completed based on the Change Management Workflow. -
Change has been tested in staging and results noted in a comment on this issue. -
A dry-run has been conducted and results noted in a comment on this issue. -
SRE on-call has been informed prior to change being rolled out. (In #production channel, mention @sre-oncall
and this issue and await their acknowledgement.) -
There are currently no active incidents.
Edited by Craig Miskell