[Gstg] Roll out queue-per-shard to workers of shard `elasticsearch`
Production Change
Change Summary
Please read scalability#1136 for more information. This is a change issue to route all jobs of workers in elasticsearch shard to elasticsearch queue on Staging
Change Details
- Services Impacted - ServiceSidekiq ServiceAPI ServiceWeb ServiceGit
- Change Technician - @cmiskell / @qmnguyen0711
- Change Reviewer - @cmiskell / @qmnguyen0711
- Time tracking - 2hr
- Downtime Component - No downtime
Detailed steps for the change
Pre-Change Steps - steps to be completed before execution of the change
*Estimated Time to Complete (10 mins)
-
Set label changein-progress on this issue - Get review and approval on
-
gitlab-com/gl-infra/k8s-workloads/gitlab-com!1144 (merged) - reconfigure queues being listened to by elasticsearch worklodas in k8s -
https://gitlab.com/gitlab-com/gl-infra/chef-repo/-/merge_requests/178 - reconfigure routing of jobs from VMs -
gitlab-com/gl-infra/k8s-workloads/gitlab-com!1145 (merged) - reconfigure routing of jobs from k8s.
-
Change Steps - steps to take to execute the change
*Estimated Time to Complete (80 mins)
-
Merge and apply the changes in gitlab-com/gl-infra/k8s-workloads/gitlab-com!1144 (merged) - In parallel, merge and wait for the apply of:
-
https://gitlab.com/gitlab-com/gl-infra/chef-repo/-/merge_requests/178 - For time efficiency (but only in gstg):
knife ssh -C1 'roles:gstg-base-fe-web' "sudo chef-client"
- For time efficiency (but only in gstg):
-
gitlab-com/gl-infra/k8s-workloads/gitlab-com!1145 (merged)
-
-
Migrate all scheduled-in-the-future jobs from the old queues to the new one. On the console node, run: sudo gitlab-rake gitlab:sidekiq:migrate_jobs:schedule gitlab:sidekiq:migrate_jobs:retry.
Post-Change Steps - steps to take to verify the change
*Estimated Time to Complete (30 mins)
As elasticsearch is a busy shard, we just need to wait for a while, and observe the shard, queue, and worker logs from Kibana: https://nonprod-log.gitlab.net/goto/2721b6c52bee324d3dfeeca5fd62f247. If there are no logs, try to edit something, like issue title, and wait for the reindex event The expected result is that json.queue will change from a per-worker name to elasticsearch
Rollback
Rollback steps - steps to be taken in the event of a need to rollback this change
Estimated Time to Complete (90 mins)
-
Rollback gitlab-com/gl-infra/k8s-workloads/gitlab-com!1145 (merged) -
Rollback https://gitlab.com/gitlab-com/gl-infra/chef-repo/-/merge_requests/178 -
Rollback gitlab-com/gl-infra/k8s-workloads/gitlab-com!1144 (merged) -
On the console node, run sudo gitlab-rake gitlab:sidekiq:migrate_jobs:scheduleandgitlab:sidekiq:migrate_jobs:retry.
Monitoring
Key metrics to observe
After the changes are applied, we should verify the changes so that the queue field in the Kibana logs should show elasticsearch for all aforementioned workers. Also, there aren't be any abnormal behaviors with the jobs afterward.
- Metric: Job Completion Rate per queue. After the change, we should observed the job completion rate in per-worker queues drained to 0, replaced by
elasticsearchqueue completion rate.- Location: https://thanos.gitlab.net/graph?g0.range_input=1h&g0.max_source_resolution=0s&g0.expr=sum%20by%20(queue)%20(increase(sidekiq_jobs_completion_seconds_count%7Benvironment%3D%22gstg%22%2C%20shard%3D%22elasticsearch%22%7D%5B5m%5D))&g0.tab=0
- Prompt to roll back: the metrics stay the same. It means something goes wrong with the configuration in Sidekiq client side. Worse, if all queues drained to 0, something goes wrong with Sidekiq client side configuration.
- Metric: Job Completion log per worker. This is the same as above metric, but from log perspective.
- Metric: Queue length:
- Location: https://dashboards.gitlab.net/d/sidekiq-shard-detail/sidekiq-shard-detail?viewPanel=3&orgId=1&var-PROMETHEUS_DS=Global&var-environment=gstg&var-stage=main&var-shard=elasticsearch&from=now-1h&to=now
- Prompt to roll back: the queue length will be eventually back to zero. If the queue length of per-worker queues go up, Sidekiq servers don't cover enough queues and break the backward compatibility. It's expected that the queue length of per-shard queue goes up for a while then back to normal. If that's not the case, the Sidekiq servers don't listen to the per-shard queue. For elasticsearch in particular the usage pattern can result in short burst of queuing but they are typically <5 minutes from initial growth to return to 0. The problem to look out for is unconstrained growth.
- Metric: Sidekiq error rate.
- Location: https://dashboards.gitlab.net/d/sidekiq-main/sidekiq-overview?viewPanel=389908901&orgId=1&var-PROMETHEUS_DS=Global&var-environment=gstg&var-stage=main
- Prompt to roll back: obviously, the migration should not affect the error rate. If this metric goes up abnormally, it means Sidekiq server fails to handle jobs. It's time to rollback.
- Metric: Redis-sidekiq CPU saturation:
- Location: https://dashboards.gitlab.net/d/redis-sidekiq-main/redis-sidekiq-overview?viewPanel=1217942947&orgId=1&var-PROMETHEUS_DS=Global&var-environment=gstg
- Prompt to roll back: This migration should not affect the CPU saturation. If the CPU saturation goes up, impaired to the job completion rate, that's a bad sign.
Summary of infrastructure changes
-
Does this change introduce new compute instances? -
Does this change re-size any existing compute instances? -
Does this change introduce any additional usage of tooling like Elastic Search, CDNs, Cloudflare, etc?
None
Changes checklist
-
This issue has a criticality label (e.g. C1, C2, C3, C4) and a change-type label (e.g. changeunscheduled, changescheduled) based on the Change Management Criticalities. -
This issue has the change technician as the assignee. -
Pre-Change, Change, Post-Change, and Rollback steps and have been filled out and reviewed. -
Necessary approvals have been completed based on the Change Management Workflow. -
Change has been tested in staging and results noted in a comment on this issue. -
A dry-run has been conducted and results noted in a comment on this issue. -
SRE on-call has been informed prior to change being rolled out. (In #production channel, mention @sre-oncalland this issue and await their acknowledgement.) -
There are currently no active incidents.