Add readiness review for sharded Sidekiq

3dd9ce10 · Sylvester Chin · 6681d2b3 · 3dd9ce10 · 3dd9ce10
Commit 3dd9ce10 authored 10 months ago by Sylvester Chin
--- a/sidekiq/img/sidekiq-sharding.png
+++ b/sidekiq/img/sidekiq-sharding.png
--- a/sidekiq/sharding.md
+++ b/sidekiq/sharding.md
+## Sharding Sidekiq for horizontal scaling
+
+Sidekiq is a service that monitors a queue for work.  When a queue has a job
+provided to it, Sidekiq then spins up an appropriate rails controller to process
+the job.  Various items place work into the queue which include GitLab cron jobs
+and user behavior on GitLab. GitLab uses a single Redis as the queue.  Queues are divided
+amongst a large fleet of servers which each contain a specific configuration.
+This configuration includes which queues that set of servers will listen, how
+many workers Sidekiq operates, and the concurrency level for that worker.
+
+Our goal is to horizontally scale Sidekiq through an application-layer router which
+routes jobs to a newly provisioned `redis-sidekiq-catchall-a`. The `catchall` Kubernetes deployment
+will poll the new Redis instance for jobs in the `queue:default` and `queue:mailers` queues.
+
+This `redis-sidekiq-catchall-a` is deployed with the labels `type=redis-sidekiq` and `shard=catchall_a` as part of the
+existing [RedisSidekiq service](https://gitlab.com/gitlab-com/runbooks/-/blob/master/metrics-catalog/services/redis-sidekiq.jsonnet).
+
+The work can be tracked in [Scalability epic 1218](https://gitlab.com/groups/gitlab-com/gl-infra/-/epics/1218).
+
+![sidekiq-sharding-architecture](img/sidekiq-sharding.png)
+
+The diagram above represents the target state after the workload migration for catchall to the new `redis-sidekiq-catchall-a`.
+In a sharded state, the `catchall` K8s deployment polls from a separate Redis compared to the rest of the Sidekiq K8s deployments.
+
+### Monitoring and Alerting
+
+_The items below will be reviewed by the Scalability:Practices team._
+
+- [ ] Link to the troubleshooting runbooks.
+
+The [Sidekiq survival guide for SREs](https://gitlab.com/gitlab-com/runbooks/-/blob/master/docs/sidekiq/sidekiq-survival-guide-for-sres.md) is the most useful troubleshooting guide for most Sidekiq issues.
+
+For shard-specific troubleshooting, refer to the [sharding guide](TBD)
+
+- [ ] Link to an example of an alert and a corresponding runbook.
+
+There are no new alerts. As the new Redis is monitored as a shard of the `redis-sidekiq` service, the related alerts will have the `shard=catchall_a` label.
+
+- [ ] Confirm that on-call SREs have access to this service and will be on-call. If this is not the case, please add an explanation here.
+
+Yes, the new Redis and Sidekiq k8s deployments are accessible using ssh for all SREs.
+
+### Operational Risk
+
+_The items below will be reviewed by the Scalability:Practices team._
+
+- [ ] Link to notes or testing results for assessing the outcome of failures of individual components.
+
+We had a 2 phase rollout where we migrate a [subset of workers](https://gitlab.com/gitlab-com/gl-infra/production/-/issues/17779), rollback, and [migrated the entire Sidekiq shard](https://gitlab.com/gitlab-com/gl-infra/production/-/issues/17841).
+
+The metrics we tracked (detailed in the rollout links) were able to surface non-critical bugs which were resolved.
+
+- [ ] What are the potential scalability or performance issues that may result with this change?
+
+This change promotes horizontal scalability of Sidekiq by sharding the workload across multiple Redises. There should be no performance issues as a result on this change on Sidekiq.
+
+- [ ] What are a few operational concerns that will not be present at launch, but may be a concern later?
+
+One concern could be feature flag state corruption through a bug or a mis-toggle. This could happen in the future and result in jobs being directed back to `redis-sidekiq`.
+
+- [ ] Are there any single points of failure in the design? If so list them here.
+
+The `redis-sidekiq-catchall-a` can be considered a single point of failure, if majority of the VMs were to be destroyed or down for any reason, Sidekiq jobs for `default` and `mailers` cannot
+be enqueued correctly. However, we are already working with this risk now since all Sidekiq workloads are being served from `redis-sidekiq`.
+
+- [ ] As a thought experiment, think of worst-case failure scenarios for this product feature, how can the blast-radius of the failure be isolated?
+
+The worst case failure scenario would be an incorrect routing logic, causing jobs to be enqueued but not picked up by any Sidekiq workers.
+The blast radius of the failure is already isolated to 2 queues out of 10 queues. However, these 2 queues account for ~50% of the load.
+
+This can be resolved by using a temporary deployment as outlined in the [troubleshooting guide](TBD) to process the dangling jobs or to perform a one-time job migration
+across instance if the bug has been resolved.
+
+### Backup, Restore, DR and Retention
+
+_The items below will be reviewed by the Scalability:Practices team._
+
+- [ ] Are there any special requirements for Disaster Recovery for both Regional and Zone failures beyond our current Disaster Recovery processes that are in place?
+
+No special requirements needed.
+
+- [ ] How does data age? Can data over a certain age be deleted?
+
+The data in `redis-sidekiq-catchall-a` is fairly transient as they are either newly enqueued jobs, scheduled jobs (which will be removed in the near future) or retry/dead jobs which can
+persist for a longer period of time. However in the context of gitlab.com, we do not act on dead jobs using the `/admin/sidekiq` page.
+
+The data which remains static are not critical to operations. Such data include metrics, Sidekiq process metadata and cron metadata which will be re-populated on a Sidekiq deployment restart or fresh deployment.
+
+### Performance, Scalability and Capacity Planning
+
+_The items below will be reviewed by the Scalability:Practices team._
+
+- [ ] Link to any performance validation that was done according to [performance guidelines](https://docs.gitlab.com/ee/development/performance.html).
+
+There were no performance tests done as there are no significant changes in architecture components. A standard Sidekiq job is performed with the same
+amount of components (Rails -> Redis -> Sidekiq). The only additional computation is an extra hash look-up and using a different Redis Client for the job push to Redis.
+
+- [ ] Link to any load testing plans and results.
+
+There was no load testing done for this as the Sidekiq load is not expected to change. (See above).
+
+- [ ] Are there any potential performance impacts on the Postgres database or Redis when this feature is enabled at GitLab.com scale?
+
+We expect the existing `redis-sidekiq` instance to experience a drop in primary CPU utilization as the load is shared with `redis-sidekiq-catchall-a`.
+
+We do not expect any performance impacts on Postgres database as the volume of Sidekiq jobs remain unchanged. This architectural change does not introduce extra load.
+
+- [ ] Explain how this feature uses our [rate limiting](https://gitlab.com/gitlab-com/runbooks/-/tree/master/docs/rate-limiting) features.
+
+Sidekiq already uses any existing form of rate limiting. Sharding Sidekiq does not change that behaviour.
+
+- [ ] Are there retry and back-off strategies for external dependencies?
+
+There are no changes to any existing strategies already in place.
+
+- [ ] Does the feature account for brief spikes in traffic, at least 2x above the expected rate?
+
+Yes. This feature will allow Sidekiq throughput (enqueue and dequeue) to better handle brief spikes in traffic. Sharding also bulkheads the `default` and `mailers` queue.
+
+### Deployment
+
+_The items below will be reviewed by the Delivery team._
+
+- [ ] Will a [change management issue](https://about.gitlab.com/handbook/engineering/infrastructure/change-management/) be used for rollout? If so, link to it here.
+
+The gstg change issue is at https://gitlab.com/gitlab-com/gl-infra/production/-/issues/17779. The gprd change issues is at (TBD)
+
+- [ ] Are there healthchecks or SLIs that can be relied on for deployment/rollbacks?
+
+As the feature release will be done through a feature flag, the deployment/rollback are not required. For change rollback (feature flag disable), we can rely on
+the default Sidekiq SLIs and apdex to determine if there is a need to halt and rollback the change.
+
+However, it is likely that there will be no apdex drop since both `catchall` and the temporary Sidekiq deployment will be present to process jobs. We would need to track
+Sidekiq job completion rates using the [dashboard](https://dashboards.gitlab.net/d/sidekiq-main/sidekiq3a-overview?orgId=1&from=now-1h&to=now&viewPanel=160) to ensure that the migration
+performs as expected.
+
+- [ ] Does building artifacts or deployment depend at all on [gitlab.com](https://gitlab.com)?
+
+In general, deployment depends on gitlab.com as it uses `k8s-workloads/gitlab-com`. It uses GitLab CI/CD to perform helm apply to deploy newer revisions. In terms of the migration rollout,
+it is entirely performed using feature flags which is dependent on chatops and Gitlab Rails to perform the relevant updates.