Skip to content

GitLab Next

Why GitLab
Pricing
Contact Sales
Explore

Sign in
Get free trial

First sidekiq routing rules using exclude_from_gitlab_com on staging

Staging Change

Change Summary

For scalability#1072 (closed) as part of changing the catchall sidekiq shard to use a single queue (default), we want to exercise the routing rules in a safe manner. The exclude_from_gitlab_com tagged workers are convenient because:

They only run on staging currently (for historical performance reasons, soon to be eliminated)
Include Geo jobs which operate extremely regularly so we'll get good data quickly
Includes Chaos::SleepWorker which does nothing but sleep, which we can use as a manually dispatched job for testing

Change Details

Services Impacted - ServiceRedis
Change Technician - @cmiskell
Change Criticality - C3
Change Type - changescheduled
Change Reviewer - Self reviewed: staging only, will develop/enhance the process and get a review for production.
Due Date - 2021-05-26 23:00 UTC
Time tracking - 1hr
Downtime Component - None

Detailed steps for the change

Pre-Change Steps - steps to be completed before execution of the change

Estimated Time to Complete (mins) - 1 minute

Obtain reviews and approvals on https://gitlab.com/gitlab-com/gl-infra/chef-repo/-/merge_requests/21 and gitlab-com/gl-infra/k8s-workloads/gitlab-com!879 (merged)
Set label changein-progress on this issue

Change Steps - steps to take to execute the change

Estimated Time to Complete (mins) - 40 minutes

In parallel:
- Merge https://gitlab.com/gitlab-com/gl-infra/chef-repo/-/merge_requests/21 and wait 35 minutes for chef to run across the staging VM fleet.
- Merge gitlab-com/gl-infra/k8s-workloads/gitlab-com!879 (merged) and wait for it to apply to the staging kubernetes fleet

Post-Change Steps - steps to take to verify the change

Estimated Time to Complete (mins) - 30 minutes

Check https://sentry.gitlab.net/gitlab/staginggitlabcom/?query=is%3Aunresolved+InvalidRoutingRuleError to ensure no routing rule errors are being reported; if they are, rollback immediately.
In the sidekiq logs verify that Geo jobs are running (Geo::RepositoryVerification::Primary::SingleWorker being the most common, at a rate of several per second), and that the queue they're running in is 'default', not the per-worker queue derived from their class name. If they are not running at all (look for job_status of done) then rollback. If they are, but are still in their queue, debug the situation and only rollback if there is no obvious resolution after a reasonable initial investigation.
- NB: This will not be possible to test in production as we're not running Geo there.
Run a Chaos::SleepWorker job manually as well:
1. In a Rails console, run Chaos::SleepWorker.perform_async(0)
2. In the logs verify that it ran, with queue default, not chaos:chaos_sleep
- NB: this will be possible to replicate in the production version of this change, when we also stop excluding exclude_from_gitlab_com jobs

Rollback

Rollback steps - steps to be taken in the event of a need to rollback this change

Estimated Time to Complete (mins) - 30 minutes

Revert the chef and k8s MRs and apply.

Monitoring

Key metrics to observe

Metric: Queueing
- Location: https://dashboards.gitlab.net/d/sidekiq-main/sidekiq-overview?orgId=1&var-PROMETHEUS_DS=Global&var-environment=gstg&var-stage=main
  - In the 'Sidekiq Queues' section, the Sidekiq Aggregated Queue length and Sidekiq Queue Lengths per Queue panels.
  - What changes to this metric should prompt a rollback: Any higher queuing than normal; queuing is usually transient
Metric: RPS
- Location: https://dashboards.gitlab.net/d/sidekiq-main/sidekiq-overview?viewPanel=975161318&orgId=1&var-PROMETHEUS_DS=Global&var-environment=gstg&var-stage=main
  - What changes to this metric should prompt a rollback: Any inexplicable drop; the normal pattern is very consistent, and any drop will be reasonably obvious.
  - This would indicate that jobs are not completing processing, which is probably also going to show up in the queue lengths.
Metric: Shard details:
- Location: https://dashboards.gitlab.net/d/sidekiq-shard-detail/sidekiq-shard-detail?viewPanel=17&orgId=1&from=now-2h&to=now&var-PROMETHEUS_DS=Global&var-environment=gstg&var-stage=main&var-shard=All
  - What changes to this metric should prompt a rollback: The distribution of work across shards changing in an inexplicable way. This link is also helpful, as it excludes catchall making the data for the lower-rate shards more discernible. Hide charts

Summary of infrastructure changes

Does this change introduce new compute instances?
Does this change re-size any existing compute instances?
Does this change introduce any additional usage of tooling like Elastic Search, CDNs, Cloudflare, etc?

None

Changes checklist

This issue has a criticality label (e.g. C1, C2, C3, C4) and a change-type label (e.g. changeunscheduled, changescheduled) based on the Change Management Criticalities.
This issue has the change technician as the assignee.
Pre-Change, Change, Post-Change, and Rollback steps and have been filled out and reviewed.
Necessary approvals have been completed based on the Change Management Workflow.
Change has been tested in staging and results noted in a comment on this issue.
A dry-run has been conducted and results noted in a comment on this issue.
SRE on-call has been informed prior to change being rolled out. (In #production channel, mention @sre-oncall and this issue and await their acknowledgement.)
There are currently no active incidents.

Edited Jun 03, 2021 by Craig Miskell

Assignee

Time tracking