Skip to content
Snippets Groups Projects
Commit 3dd9ce10 authored by Sylvester Chin's avatar Sylvester Chin :red_circle:
Browse files

Add readiness review for sharded Sidekiq

parent 6681d2b3
No related branches found
No related tags found
1 merge request!213Add readiness review for sharded Sidekiq
sidekiq/img/sidekiq-sharding.png

154 KiB

## Sharding Sidekiq for horizontal scaling
Sidekiq is a service that monitors a queue for work. When a queue has a job
provided to it, Sidekiq then spins up an appropriate rails controller to process
the job. Various items place work into the queue which include GitLab cron jobs
and user behavior on GitLab. GitLab uses a single Redis as the queue. Queues are divided
amongst a large fleet of servers which each contain a specific configuration.
This configuration includes which queues that set of servers will listen, how
many workers Sidekiq operates, and the concurrency level for that worker.
Our goal is to horizontally scale Sidekiq through an application-layer router which
routes jobs to a newly provisioned `redis-sidekiq-catchall-a`. The `catchall` Kubernetes deployment
will poll the new Redis instance for jobs in the `queue:default` and `queue:mailers` queues.
This `redis-sidekiq-catchall-a` is deployed with the labels `type=redis-sidekiq` and `shard=catchall_a` as part of the
existing [RedisSidekiq service](https://gitlab.com/gitlab-com/runbooks/-/blob/master/metrics-catalog/services/redis-sidekiq.jsonnet).
The work can be tracked in [Scalability epic 1218](https://gitlab.com/groups/gitlab-com/gl-infra/-/epics/1218).
![sidekiq-sharding-architecture](img/sidekiq-sharding.png)
The diagram above represents the target state after the workload migration for catchall to the new `redis-sidekiq-catchall-a`.
In a sharded state, the `catchall` K8s deployment polls from a separate Redis compared to the rest of the Sidekiq K8s deployments.
### Monitoring and Alerting
_The items below will be reviewed by the Scalability:Practices team._
- [ ] Link to the troubleshooting runbooks.
The [Sidekiq survival guide for SREs](https://gitlab.com/gitlab-com/runbooks/-/blob/master/docs/sidekiq/sidekiq-survival-guide-for-sres.md) is the most useful troubleshooting guide for most Sidekiq issues.
For shard-specific troubleshooting, refer to the [sharding guide](TBD)
- [ ] Link to an example of an alert and a corresponding runbook.
There are no new alerts. As the new Redis is monitored as a shard of the `redis-sidekiq` service, the related alerts will have the `shard=catchall_a` label.
- [ ] Confirm that on-call SREs have access to this service and will be on-call. If this is not the case, please add an explanation here.
Yes, the new Redis and Sidekiq k8s deployments are accessible using ssh for all SREs.
### Operational Risk
_The items below will be reviewed by the Scalability:Practices team._
- [ ] Link to notes or testing results for assessing the outcome of failures of individual components.
We had a 2 phase rollout where we migrate a [subset of workers](https://gitlab.com/gitlab-com/gl-infra/production/-/issues/17779), rollback, and [migrated the entire Sidekiq shard](https://gitlab.com/gitlab-com/gl-infra/production/-/issues/17841).
The metrics we tracked (detailed in the rollout links) were able to surface non-critical bugs which were resolved.
- [ ] What are the potential scalability or performance issues that may result with this change?
This change promotes horizontal scalability of Sidekiq by sharding the workload across multiple Redises. There should be no performance issues as a result on this change on Sidekiq.
- [ ] What are a few operational concerns that will not be present at launch, but may be a concern later?
One concern could be feature flag state corruption through a bug or a mis-toggle. This could happen in the future and result in jobs being directed back to `redis-sidekiq`.
- [ ] Are there any single points of failure in the design? If so list them here.
The `redis-sidekiq-catchall-a` can be considered a single point of failure, if majority of the VMs were to be destroyed or down for any reason, Sidekiq jobs for `default` and `mailers` cannot
be enqueued correctly. However, we are already working with this risk now since all Sidekiq workloads are being served from `redis-sidekiq`.
- [ ] As a thought experiment, think of worst-case failure scenarios for this product feature, how can the blast-radius of the failure be isolated?
The worst case failure scenario would be an incorrect routing logic, causing jobs to be enqueued but not picked up by any Sidekiq workers.
The blast radius of the failure is already isolated to 2 queues out of 10 queues. However, these 2 queues account for ~50% of the load.
This can be resolved by using a temporary deployment as outlined in the [troubleshooting guide](TBD) to process the dangling jobs or to perform a one-time job migration
across instance if the bug has been resolved.
### Backup, Restore, DR and Retention
_The items below will be reviewed by the Scalability:Practices team._
- [ ] Are there any special requirements for Disaster Recovery for both Regional and Zone failures beyond our current Disaster Recovery processes that are in place?
No special requirements needed.
- [ ] How does data age? Can data over a certain age be deleted?
The data in `redis-sidekiq-catchall-a` is fairly transient as they are either newly enqueued jobs, scheduled jobs (which will be removed in the near future) or retry/dead jobs which can
persist for a longer period of time. However in the context of gitlab.com, we do not act on dead jobs using the `/admin/sidekiq` page.
The data which remains static are not critical to operations. Such data include metrics, Sidekiq process metadata and cron metadata which will be re-populated on a Sidekiq deployment restart or fresh deployment.
### Performance, Scalability and Capacity Planning
_The items below will be reviewed by the Scalability:Practices team._
- [ ] Link to any performance validation that was done according to [performance guidelines](https://docs.gitlab.com/ee/development/performance.html).
There were no performance tests done as there are no significant changes in architecture components. A standard Sidekiq job is performed with the same
amount of components (Rails -> Redis -> Sidekiq). The only additional computation is an extra hash look-up and using a different Redis Client for the job push to Redis.
- [ ] Link to any load testing plans and results.
There was no load testing done for this as the Sidekiq load is not expected to change. (See above).
- [ ] Are there any potential performance impacts on the Postgres database or Redis when this feature is enabled at GitLab.com scale?
We expect the existing `redis-sidekiq` instance to experience a drop in primary CPU utilization as the load is shared with `redis-sidekiq-catchall-a`.
We do not expect any performance impacts on Postgres database as the volume of Sidekiq jobs remain unchanged. This architectural change does not introduce extra load.
- [ ] Explain how this feature uses our [rate limiting](https://gitlab.com/gitlab-com/runbooks/-/tree/master/docs/rate-limiting) features.
Sidekiq already uses any existing form of rate limiting. Sharding Sidekiq does not change that behaviour.
- [ ] Are there retry and back-off strategies for external dependencies?
There are no changes to any existing strategies already in place.
- [ ] Does the feature account for brief spikes in traffic, at least 2x above the expected rate?
Yes. This feature will allow Sidekiq throughput (enqueue and dequeue) to better handle brief spikes in traffic. Sharding also bulkheads the `default` and `mailers` queue.
### Deployment
_The items below will be reviewed by the Delivery team._
- [ ] Will a [change management issue](https://about.gitlab.com/handbook/engineering/infrastructure/change-management/) be used for rollout? If so, link to it here.
The gstg change issue is at https://gitlab.com/gitlab-com/gl-infra/production/-/issues/17779. The gprd change issues is at (TBD)
- [ ] Are there healthchecks or SLIs that can be relied on for deployment/rollbacks?
As the feature release will be done through a feature flag, the deployment/rollback are not required. For change rollback (feature flag disable), we can rely on
the default Sidekiq SLIs and apdex to determine if there is a need to halt and rollback the change.
However, it is likely that there will be no apdex drop since both `catchall` and the temporary Sidekiq deployment will be present to process jobs. We would need to track
Sidekiq job completion rates using the [dashboard](https://dashboards.gitlab.net/d/sidekiq-main/sidekiq3a-overview?orgId=1&from=now-1h&to=now&viewPanel=160) to ensure that the migration
performs as expected.
- [ ] Does building artifacts or deployment depend at all on [gitlab.com](https://gitlab.com)?
In general, deployment depends on gitlab.com as it uses `k8s-workloads/gitlab-com`. It uses GitLab CI/CD to perform helm apply to deploy newer revisions. In terms of the migration rollout,
it is entirely performed using feature flags which is dependent on chatops and Gitlab Rails to perform the relevant updates.
0% Loading or .
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment