Upgrade gitlab_exporter on Redis Sidekiq in production
Production Change
Change Summary
To support observability when we are running one queue per shard (&447 (closed)) in sidekiq, we need to upgrade gitlab-exporter to the latest version, and add an additional probe.
Change Details
- Services Impacted - ServiceRedis ServiceSidekiq
- Change Technician - @msmiley / @cmiskell
- Change Reviewer - @msmiley / @cmiskell
- Time tracking - 2h
- Downtime Component - None
Detailed steps for the change
Pre-Change Steps - steps to be completed before execution of the change
Estimated Time to Complete (mins) - 2m
-
Confirm the backward-compatible dashboard and alert updates have been merged and deployed: gitlab-com/runbooks!3653 (merged) -
Obtain approval on: https://gitlab.com/gitlab-com/gl-infra/chef-repo/-/merge_requests/109 -
Set label changein-progress on this issue -
Add a silence for SidekiqSchedulingLatencyTooHigh
Change Steps - steps to take to execute the change
Estimated Time to Complete (mins) - 45 minutes
-
Disable chef on the Redis sidekiq nodes: knife ssh roles:gprd-base-db-redis-server-sidekiq "sudo chef-client-disable "Production upgrade of gitlab-exporter: #4935"
-
Merge the dashboards/metrics change: gitlab-com/runbooks!3653 (merged) -
Delete the older gitlab-monitor.gemspec file on the relevant Redis nodes: knife ssh roles:gprd-base-db-redis-server-sidekiq "sudo rm /opt/gitlab-monitor/gitlab-monitor.gemspec
-
Merge https://gitlab.com/gitlab-com/gl-infra/chef-repo/-/merge_requests/109 - Run chef on the Redis nodes in a controlled manner. To validate the performance impact, do it on a replica first before the primary.
-
Identify the current primary -
Re-enable and run chef on one of the other 3 nodes in the cluster (a replica): sudo chef-client-enable && sudo chef-client. - Monitor the performance of the replica over 10 minutes. It will be querying/reporting the same data as the primary, so should cause the same amount of load and have the same timings. Check:
-
https://dashboards.gitlab.net/d/redis-sidekiq-main/redis-sidekiq-overview?orgId=1, in particular the "Redis CPU per Node - Replicas" panel, but eyeball the others for anomalies -
Slow Logs: https://log.gprd.gitlab.net/goto/c528546500306bb25558120090484be2. It's possible that the EVALSHA invocation of the lua script by probe_jobs will show up here. Acceptable times are debatable, but up to 50ms (0.05s) is big but ok, and up to 100ms tolerable, but above that is into dangerous territory. It is only called once per scrape, but any big runtimes here impact all job scheduling and pauses can have odd reverberating effects on the rest of the system. If the data is unclear/variable, the scrape can be invoked manually with curl -v http://localhost:4567/sidekiqon the node itself a few times to get more data.
-
-
If the impact on the replica is acceptable, enable + run chef on the other replica, and then the primary -
After the primary is upgraded, continue to monitor the metrics + logs as for the replica, for about 15 minutes, paying close attention to things like Redis Primary CPU saturation. - If there is any unsustainable impact, simply disabling probe_jobs is the first step, then downgrading if absolutely necessary. See rollback steps.
-
Rollback
Rollback steps - steps to be taken in the event of a need to rollback this change
Estimated Time to Complete (mins) - 10 minutes
Disabling probe_jobs
If only the new LUA script is problematic, we can simply disable probe_jobs:
-
If time is critical (impact is high and unsustainable), in /opt/gitlab-monitor/config/redis-config.ymlremove theprobe_jobsline and restart gitlab-monitor (sudo sv restart gitlab-monitor). Focus on the Redis primary first -
Remove probe_jobs from roles/gprd-base-db-redis-server-sidekiq.jsonin chef (via an MR).
Rollback to gitlab-monitor
If disabling probe_jobs is not enough and a full rollback is required:
-
Revert and apply the changes in https://gitlab.com/gitlab-com/gl-infra/chef-repo/-/merge_requests/109. It may also be necessary to manually delete /opt/gitlab-monitor/gitlab-exporter.gemspecon the relevant nodes, before chef applies.
Monitoring
Key metrics to observe
-
Metric: Redis Primary CPU component saturation (yellow line, Saturation graph, top right)
- Location: https://dashboards.gitlab.net/d/redis-sidekiq-main/redis-sidekiq-overview?orgId=1
- What changes to this metric should prompt a rollback: More than 2 % increase (basically: any visible consistent increase).
-
Metric: Slow log reports for EVALSHA
- Location: https://log.gprd.gitlab.net/goto/c528546500306bb25558120090484be2
- What changes to this metric should prompt a rollback: Entries showing the LUA evaluation regularly taking > 0.1s (100ms).
Summary of infrastructure changes
-
Does this change introduce new compute instances? -
Does this change re-size any existing compute instances? -
Does this change introduce any additional usage of tooling like Elastic Search, CDNs, Cloudflare, etc?
None
Changes checklist
-
This issue has a criticality label (e.g. C1, C2, C3, C4) and a change-type label (e.g. changeunscheduled, changescheduled) based on the Change Management Criticalities. -
This issue has the change technician as the assignee. -
Pre-Change, Change, Post-Change, and Rollback steps and have been filled out and reviewed. -
Necessary approvals have been completed based on the Change Management Workflow. -
Change has been tested in staging and results noted in a comment on this issue. -
A dry-run has been conducted and results noted in a comment on this issue. -
SRE on-call has been informed prior to change being rolled out. (In #production channel, mention @sre-oncalland this issue and await their acknowledgement.) -
There are currently no active incidents.
Edited by Matt Smiley