Add redis-tracechunks connection

Production Change

Change Summary

For &462 (closed) we are splitting the CI job trace chunks storage to a dedicated Redis. The VMs are built, and the application can now accept configuration of that connection, with active use still being gated behind a feature flag that is to be enabled later in a further change issue.

This change provides that configuration to both gstg and gprd).

Because of the small risk of configuration leakage (if there is any bugs in the connection handling) that might cause this new connection to be actively used before we are expecting it, this is considered a C2: (because still carry some risk of impact if something unexpected happens)

Change Details

Services Impacted - ServiceRedis
Change Technician - @cmiskell
Change Reviewer - @cmcfarland
Time tracking - 1.25hr
Downtime Component - None

Detailed steps for the change

Pre-Change Steps - steps to be completed before execution of the change

Estimated Time to Complete (mins) - 1 minute

Obtain review/approvals on:
- gstg chef: https://gitlab.com/gitlab-com/gl-infra/chef-repo/-/merge_requests/112
- gstg/basic k8s: gitlab-com/gl-infra/k8s-workloads/gitlab-com!936 (merged)
- gprd: https://gitlab.com/gitlab-com/gl-infra/chef-repo/-/merge_requests/113
- gprd k8s: gitlab-com/gl-infra/k8s-workloads/gitlab-com!937 (merged)
Set label changein-progress on this issue

Change Steps - steps to take to execute the change

Estimated Time to Complete (mins) - 60 minutes

Repeat on production

In parallel, merge and monitor the deployment of:
- https://gitlab.com/gitlab-com/gl-infra/chef-repo/-/merge_requests/113
- From a local copy of chef-repo, run ./bin/gkms-vault-edit gitlab-omnibus-secrets gprd and add an entry for redis_trace_chunks_instance alongside the existing redis_queues_instance and redis_cache_instance URLs; same password, just adjust the identifier at the end
  - Allow run naturally over 35 minutes (manual action is inefficient)
- gitlab-com/gl-infra/k8s-workloads/gitlab-com!937 (merged): should take about 40 minutes to apply
Verify the results:
- Check for the string "redis-tracechunks" in /etc/gitlab/gitlab.rb on a web node: knife ssh web-01-sv-gprd.c.gitlab-production.internal "sudo grep tracechunks /etc/gitlab/gitlab.rb"
- Check for the presence of redis.trace_chunks.yml file: knife ssh web-01-sv-gprd.c.gitlab-production.internal "sudo ls -l /var/opt/gitlab/gitlab-rails/etc/redis.trace_chunks.yml"
- Check for the connection in the webservice configmap in k8s. In a simple SSH shell on the console server: kubectl --cluster gke_gitlab-production_us-east1-d_gprd-us-east1-d -n gitlab describe configmap gitlab-webservice|grep redis-trace
- Start a shell on the current redis primary and run: REDISCLI_AUTH=$(sudo grep ^masterauth /var/opt/gitlab/redis/redis.conf|cut -d\" -f2) /opt/gitlab/embedded/bin/redis-cli MONITOR
- In a Rails console, execute Gitlab::Redis::TraceChunks.with { | r | r.ping }
  - We expect a "PONG" result, and to see the ping in the output of the monitor (other than the PING and PUBLISH traffic from the replicas). If the monitor output is not seen the connection may still be to the SharedState. Try executing (in the console) GitLab::Redis::TraceChunks::config_file_name to see what config file is in use.
- Repeat this but in a console started (/srv/gitlab/bin/rails console) from within a Rails docker container (ssh to a node + docker exec into the container).

Post-Change Steps - steps to take to verify the change

See in-line steps during the execution.

Rollback

Rollback steps - steps to be taken in the event of a need to rollback this change

Estimated Time to Complete (mins) - 1hr

Create new MRs reverting the MRs/commits already applied and apply them.

Monitoring

Key metrics to observe

Metric: All metrics on the Tracechunks dashboard
- Location: https://dashboards.gitlab.net/d/redis-tracechunks-main/redis-tracechunks-overview?orgId=1
- What changes to this metric should prompt a rollback:
  - Not seeing a few connections and the ping operations from the verification steps
  - Seeing a large amount of network traffic, RPS, and operations (beyond base replication traffic) implying it is being used unexpectedly.
Metric: All metrics on the shard state dashboard
- Location: https://dashboards.gitlab.net/d/redis-main/redis-overview?orgId=1
- What changes to this metric should prompt a rollback:
  - A substantial drop, particularly in RPS, operationrates, and especially network traffic (trace chunk traffic accounts for roughly 50% of current throughput on the shared state Redis; a drop would imply traffic has unexpectedly moved to the tracechunks cluster.

Summary of infrastructure changes

Does this change introduce new compute instances?
Does this change re-size any existing compute instances?
Does this change introduce any additional usage of tooling like Elastic Search, CDNs, Cloudflare, etc?

None of the above

Changes checklist

This issue has a criticality label (e.g. C1, C2, C3, C4) and a change-type label (e.g. changeunscheduled, changescheduled) based on the Change Management Criticalities.
This issue has the change technician as the assignee.
Pre-Change, Change, Post-Change, and Rollback steps and have been filled out and reviewed.
Necessary approvals have been completed based on the Change Management Workflow.
Change has been tested in staging and results noted in a comment on this issue.
A dry-run has been conducted and results noted in a comment on this issue.
SRE on-call has been informed prior to change being rolled out. (In #production channel, mention @sre-oncall and this issue and await their acknowledgement.)
There are currently no active incidents.

Edited Jun 16, 2021 by Craig Miskell