[Gstg] Roll out queue-per-shard to workers of shard `elasticsearch`

Production Change

Change Summary

Please read scalability#1136 for more information. This is a change issue to route all jobs of workers in elasticsearch shard to elasticsearch queue on Staging

Change Details

Services Impacted - ServiceSidekiq ServiceAPI ServiceWeb ServiceGit
Change Technician - @cmiskell / @qmnguyen0711
Change Reviewer - @cmiskell / @qmnguyen0711
Time tracking - 2hr
Downtime Component - No downtime

Detailed steps for the change

Pre-Change Steps - steps to be completed before execution of the change

*Estimated Time to Complete (10 mins)

Set label changein-progress on this issue
Get review and approval on
- gitlab-com/gl-infra/k8s-workloads/gitlab-com!1144 (merged) - reconfigure queues being listened to by elasticsearch worklodas in k8s
- https://gitlab.com/gitlab-com/gl-infra/chef-repo/-/merge_requests/178 - reconfigure routing of jobs from VMs
- gitlab-com/gl-infra/k8s-workloads/gitlab-com!1145 (merged) - reconfigure routing of jobs from k8s.

Change Steps - steps to take to execute the change

*Estimated Time to Complete (80 mins)

Merge and apply the changes in gitlab-com/gl-infra/k8s-workloads/gitlab-com!1144 (merged)
In parallel, merge and wait for the apply of:
- https://gitlab.com/gitlab-com/gl-infra/chef-repo/-/merge_requests/178
  - For time efficiency (but only in gstg): knife ssh -C1 'roles:gstg-base-fe-web' "sudo chef-client"
- gitlab-com/gl-infra/k8s-workloads/gitlab-com!1145 (merged)
Migrate all scheduled-in-the-future jobs from the old queues to the new one. On the console node, run: sudo gitlab-rake gitlab:sidekiq:migrate_jobs:schedule gitlab:sidekiq:migrate_jobs:retry.

Post-Change Steps - steps to take to verify the change

*Estimated Time to Complete (30 mins)

As elasticsearch is a busy shard, we just need to wait for a while, and observe the shard, queue, and worker logs from Kibana: https://nonprod-log.gitlab.net/goto/2721b6c52bee324d3dfeeca5fd62f247. If there are no logs, try to edit something, like issue title, and wait for the reindex event The expected result is that json.queue will change from a per-worker name to elasticsearch

Rollback

Rollback steps - steps to be taken in the event of a need to rollback this change

Estimated Time to Complete (90 mins)

Rollback gitlab-com/gl-infra/k8s-workloads/gitlab-com!1145 (merged)
Rollback https://gitlab.com/gitlab-com/gl-infra/chef-repo/-/merge_requests/178
Rollback gitlab-com/gl-infra/k8s-workloads/gitlab-com!1144 (merged)
On the console node, run sudo gitlab-rake gitlab:sidekiq:migrate_jobs:schedule and gitlab:sidekiq:migrate_jobs:retry.

Monitoring

Key metrics to observe

After the changes are applied, we should verify the changes so that the queue field in the Kibana logs should show elasticsearch for all aforementioned workers. Also, there aren't be any abnormal behaviors with the jobs afterward.

Metric: Job Completion Rate per queue. After the change, we should observed the job completion rate in per-worker queues drained to 0, replaced by elasticsearch queue completion rate.
- Location: https://thanos.gitlab.net/graph?g0.range_input=1h&g0.max_source_resolution=0s&g0.expr=sum%20by%20(queue)%20(increase(sidekiq_jobs_completion_seconds_count%7Benvironment%3D%22gstg%22%2C%20shard%3D%22elasticsearch%22%7D%5B5m%5D))&g0.tab=0
- Prompt to roll back: the metrics stay the same. It means something goes wrong with the configuration in Sidekiq client side. Worse, if all queues drained to 0, something goes wrong with Sidekiq client side configuration.
Metric: Job Completion log per worker. This is the same as above metric, but from log perspective.
- Location: https://nonprod-log.gitlab.net/goto/bf9e362f022854478e28e31bf99c0101
Metric: Queue length:
- Location: https://dashboards.gitlab.net/d/sidekiq-shard-detail/sidekiq-shard-detail?viewPanel=3&orgId=1&var-PROMETHEUS_DS=Global&var-environment=gstg&var-stage=main&var-shard=elasticsearch&from=now-1h&to=now
- Prompt to roll back: the queue length will be eventually back to zero. If the queue length of per-worker queues go up, Sidekiq servers don't cover enough queues and break the backward compatibility. It's expected that the queue length of per-shard queue goes up for a while then back to normal. If that's not the case, the Sidekiq servers don't listen to the per-shard queue. For elasticsearch in particular the usage pattern can result in short burst of queuing but they are typically <5 minutes from initial growth to return to 0. The problem to look out for is unconstrained growth.
Metric: Sidekiq error rate.
- Location: https://dashboards.gitlab.net/d/sidekiq-main/sidekiq-overview?viewPanel=389908901&orgId=1&var-PROMETHEUS_DS=Global&var-environment=gstg&var-stage=main
- Prompt to roll back: obviously, the migration should not affect the error rate. If this metric goes up abnormally, it means Sidekiq server fails to handle jobs. It's time to rollback.
Metric: Redis-sidekiq CPU saturation:
- Location: https://dashboards.gitlab.net/d/redis-sidekiq-main/redis-sidekiq-overview?viewPanel=1217942947&orgId=1&var-PROMETHEUS_DS=Global&var-environment=gstg
- Prompt to roll back: This migration should not affect the CPU saturation. If the CPU saturation goes up, impaired to the job completion rate, that's a bad sign.

Summary of infrastructure changes

Does this change introduce new compute instances?
Does this change re-size any existing compute instances?
Does this change introduce any additional usage of tooling like Elastic Search, CDNs, Cloudflare, etc?

None

Changes checklist

This issue has a criticality label (e.g. C1, C2, C3, C4) and a change-type label (e.g. changeunscheduled, changescheduled) based on the Change Management Criticalities.
This issue has the change technician as the assignee.
Pre-Change, Change, Post-Change, and Rollback steps and have been filled out and reviewed.
Necessary approvals have been completed based on the Change Management Workflow.
Change has been tested in staging and results noted in a comment on this issue.
A dry-run has been conducted and results noted in a comment on this issue.
SRE on-call has been informed prior to change being rolled out. (In #production channel, mention @sre-oncall and this issue and await their acknowledgement.)
There are currently no active incidents.

Edited Aug 24, 2021 by Craig Miskell

Assignee Loading

Time tracking Loading