Upgrade gitlab_exporter on Redis Sidekiq in production

Production Change

Change Summary

To support observability when we are running one queue per shard (&447 (closed)) in sidekiq, we need to upgrade gitlab-exporter to the latest version, and add an additional probe.

Change Details

Services Impacted - ServiceRedis ServiceSidekiq
Change Technician - @msmiley / @cmiskell
Change Reviewer - @msmiley / @cmiskell
Time tracking - 2h
Downtime Component - None

Detailed steps for the change

Pre-Change Steps - steps to be completed before execution of the change

Estimated Time to Complete (mins) - 2m

Confirm the backward-compatible dashboard and alert updates have been merged and deployed: gitlab-com/runbooks!3653 (merged)
Obtain approval on: https://gitlab.com/gitlab-com/gl-infra/chef-repo/-/merge_requests/109
Set label changein-progress on this issue
Add a silence for SidekiqSchedulingLatencyTooHigh

Change Steps - steps to take to execute the change

Estimated Time to Complete (mins) - 45 minutes

Disable chef on the Redis sidekiq nodes:
- knife ssh roles:gprd-base-db-redis-server-sidekiq "sudo chef-client-disable "Production upgrade of gitlab-exporter: #4935"
Merge the dashboards/metrics change: gitlab-com/runbooks!3653 (merged)
Delete the older gitlab-monitor.gemspec file on the relevant Redis nodes:
- knife ssh roles:gprd-base-db-redis-server-sidekiq "sudo rm /opt/gitlab-monitor/gitlab-monitor.gemspec
Merge https://gitlab.com/gitlab-com/gl-infra/chef-repo/-/merge_requests/109
Run chef on the Redis nodes in a controlled manner. To validate the performance impact, do it on a replica first before the primary.
- Identify the current primary
- Re-enable and run chef on one of the other 3 nodes in the cluster (a replica): sudo chef-client-enable && sudo chef-client.
- Monitor the performance of the replica over 10 minutes. It will be querying/reporting the same data as the primary, so should cause the same amount of load and have the same timings. Check:
  - https://dashboards.gitlab.net/d/redis-sidekiq-main/redis-sidekiq-overview?orgId=1, in particular the "Redis CPU per Node - Replicas" panel, but eyeball the others for anomalies
  - Slow Logs: https://log.gprd.gitlab.net/goto/c528546500306bb25558120090484be2. It's possible that the EVALSHA invocation of the lua script by probe_jobs will show up here. Acceptable times are debatable, but up to 50ms (0.05s) is big but ok, and up to 100ms tolerable, but above that is into dangerous territory. It is only called once per scrape, but any big runtimes here impact all job scheduling and pauses can have odd reverberating effects on the rest of the system. If the data is unclear/variable, the scrape can be invoked manually with curl -v http://localhost:4567/sidekiq on the node itself a few times to get more data.
- If the impact on the replica is acceptable, enable + run chef on the other replica, and then the primary
- After the primary is upgraded, continue to monitor the metrics + logs as for the replica, for about 15 minutes, paying close attention to things like Redis Primary CPU saturation.
  - If there is any unsustainable impact, simply disabling probe_jobs is the first step, then downgrading if absolutely necessary. See rollback steps.

Rollback

Rollback steps - steps to be taken in the event of a need to rollback this change

Estimated Time to Complete (mins) - 10 minutes

Disabling probe_jobs

If only the new LUA script is problematic, we can simply disable probe_jobs:

If time is critical (impact is high and unsustainable), in /opt/gitlab-monitor/config/redis-config.yml remove the probe_jobs line and restart gitlab-monitor (sudo sv restart gitlab-monitor). Focus on the Redis primary first
Remove probe_jobs from roles/gprd-base-db-redis-server-sidekiq.json in chef (via an MR).

Rollback to gitlab-monitor

If disabling probe_jobs is not enough and a full rollback is required:

Revert and apply the changes in https://gitlab.com/gitlab-com/gl-infra/chef-repo/-/merge_requests/109. It may also be necessary to manually delete /opt/gitlab-monitor/gitlab-exporter.gemspec on the relevant nodes, before chef applies.

Monitoring

Key metrics to observe

Metric: Redis Primary CPU component saturation (yellow line, Saturation graph, top right)
- Location: https://dashboards.gitlab.net/d/redis-sidekiq-main/redis-sidekiq-overview?orgId=1
- What changes to this metric should prompt a rollback: More than 2 % increase (basically: any visible consistent increase).
Metric: Slow log reports for EVALSHA
- Location: https://log.gprd.gitlab.net/goto/c528546500306bb25558120090484be2
- What changes to this metric should prompt a rollback: Entries showing the LUA evaluation regularly taking > 0.1s (100ms).

Summary of infrastructure changes

Does this change introduce new compute instances?
Does this change re-size any existing compute instances?
Does this change introduce any additional usage of tooling like Elastic Search, CDNs, Cloudflare, etc?

None

Changes checklist

This issue has a criticality label (e.g. C1, C2, C3, C4) and a change-type label (e.g. changeunscheduled, changescheduled) based on the Change Management Criticalities.
This issue has the change technician as the assignee.
Pre-Change, Change, Post-Change, and Rollback steps and have been filled out and reviewed.
Necessary approvals have been completed based on the Change Management Workflow.
Change has been tested in staging and results noted in a comment on this issue.
A dry-run has been conducted and results noted in a comment on this issue.
SRE on-call has been informed prior to change being rolled out. (In #production channel, mention @sre-oncall and this issue and await their acknowledgement.)
There are currently no active incidents.

Edited Jun 25, 2021 by Matt Smiley

Assignee Loading

Time tracking Loading