Add redis-tracechunks connection
Production Change
Change Summary
For &462 (closed) we are splitting the CI job trace chunks storage to a dedicated Redis. The VMs are built, and the application can now accept configuration of that connection, with active use still being gated behind a feature flag that is to be enabled later in a further change issue.
This change provides that configuration to both gstg and gprd).
Because of the small risk of configuration leakage (if there is any bugs in the connection handling) that might cause this new connection to be actively used before we are expecting it, this is considered a C2: (because still carry some risk of impact if something unexpected happens
)
Change Details
- Services Impacted - ServiceRedis
- Change Technician - @cmiskell
- Change Reviewer - @cmcfarland
- Time tracking - 1.25hr
- Downtime Component - None
Detailed steps for the change
Pre-Change Steps - steps to be completed before execution of the change
Estimated Time to Complete (mins) - 1 minute
-
Obtain review/approvals on:
-
Set label changein-progress on this issue
Change Steps - steps to take to execute the change
Estimated Time to Complete (mins) - 60 minutes
-
Merge and ensure https://gitlab.com/gitlab-com/gl-infra/chef-repo/-/merge_requests/112 has been applied -
From a local copy of chef-repo, run ./bin/gkms-vault-edit gitlab-omnibus-secrets gstg
and add an entry forredis_trace_chunks_instance
alongside the existingredis_queues_instance
andredis_cache_instance
URLs; same password, just adjust the identifier at the end - In parallel:
-
Run chef manually with: knife -C1 ssh 'roles:gstg-base-fe-web OR roles:gstg-base-be-sidekiq' "sudo chef-client"
- There are only 4 nodes which may actively need this configuration and this will take about 6 minutes; otherwise we have to wait 35+ minutes for them to run in natural time.
-
Merge and monitor the application of gitlab-com/gl-infra/k8s-workloads/gitlab-com!936 (merged): should take about 15 minutes to apply. It requires the chef change to be in place first to pick up the new external values.
-
- Verify the results:
-
Check for the string "redis-tracechunks" in /etc/gitlab/gitlab.rb on a web node: knife ssh web-01-sv-gstg.c.gitlab-staging-1.internal "sudo grep tracechunks /etc/gitlab/gitlab.rb"
-
Check for the presence of redis.trace_chunks.yml file: knife ssh web-01-sv-gstg.c.gitlab-staging-1.internal "sudo ls -l /var/opt/gitlab/gitlab-rails/etc/redis.trace_chunks.yml"
-
Check for the connection in the webservice configmap in k8s. In a simple SSH shell on the console server: kubectl --cluster gke_gitlab-staging-1_us-east1-d_gstg-us-east1-d -n gitlab describe configmap gitlab-webservice|grep redis-trace
-
Start a shell on the current redis primary and run: REDISCLI_AUTH=$(sudo grep ^masterauth /var/opt/gitlab/redis/redis.conf|cut -d\" -f2) /opt/gitlab/embedded/bin/redis-cli MONITOR
-
In a Rails console, execute Gitlab::Redis::TraceChunks.with { | r | r.ping }
- We expect a "PONG" result, and to see the ping in the output of the monitor (other than the PING and PUBLISH traffic from the replicas). If the monitor output is not seen the connection may still be to the SharedState. Try executing (in the console)
GitLab::Redis::TraceChunks::config_file_name
to see what config file is in use.
- We expect a "PONG" result, and to see the ping in the output of the monitor (other than the PING and PUBLISH traffic from the replicas). If the monitor output is not seen the connection may still be to the SharedState. Try executing (in the console)
-
Repeat this but in a console started ( /srv/gitlab/bin/rails console
) from within a Rails docker container (ssh to a node + docker exec into the container). - Failure to see the expected traffic to the correct Redis instance is grounds for aborting this change issue.
-
Repeat on production
- In parallel, merge and monitor the deployment of:
-
https://gitlab.com/gitlab-com/gl-infra/chef-repo/-/merge_requests/113 -
From a local copy of chef-repo, run ./bin/gkms-vault-edit gitlab-omnibus-secrets gprd
and add an entry forredis_trace_chunks_instance
alongside the existingredis_queues_instance
andredis_cache_instance
URLs; same password, just adjust the identifier at the end- Allow run naturally over 35 minutes (manual action is inefficient)
-
gitlab-com/gl-infra/k8s-workloads/gitlab-com!937 (merged): should take about 40 minutes to apply
-
- Verify the results:
-
Check for the string "redis-tracechunks" in /etc/gitlab/gitlab.rb on a web node: knife ssh web-01-sv-gprd.c.gitlab-production.internal "sudo grep tracechunks /etc/gitlab/gitlab.rb"
-
Check for the presence of redis.trace_chunks.yml file: knife ssh web-01-sv-gprd.c.gitlab-production.internal "sudo ls -l /var/opt/gitlab/gitlab-rails/etc/redis.trace_chunks.yml"
-
Check for the connection in the webservice configmap in k8s. In a simple SSH shell on the console server: kubectl --cluster gke_gitlab-production_us-east1-d_gprd-us-east1-d -n gitlab describe configmap gitlab-webservice|grep redis-trace
-
Start a shell on the current redis primary and run: REDISCLI_AUTH=$(sudo grep ^masterauth /var/opt/gitlab/redis/redis.conf|cut -d\" -f2) /opt/gitlab/embedded/bin/redis-cli MONITOR
-
In a Rails console, execute Gitlab::Redis::TraceChunks.with { | r | r.ping }
- We expect a "PONG" result, and to see the ping in the output of the monitor (other than the PING and PUBLISH traffic from the replicas). If the monitor output is not seen the connection may still be to the SharedState. Try executing (in the console)
GitLab::Redis::TraceChunks::config_file_name
to see what config file is in use.
- We expect a "PONG" result, and to see the ping in the output of the monitor (other than the PING and PUBLISH traffic from the replicas). If the monitor output is not seen the connection may still be to the SharedState. Try executing (in the console)
-
Repeat this but in a console started ( /srv/gitlab/bin/rails console
) from within a Rails docker container (ssh to a node + docker exec into the container).
-
Post-Change Steps - steps to take to verify the change
See in-line steps during the execution.
Rollback
Rollback steps - steps to be taken in the event of a need to rollback this change
Estimated Time to Complete (mins) - 1hr
-
Create new MRs reverting the MRs/commits already applied and apply them.
Monitoring
Key metrics to observe
- Metric: All metrics on the Tracechunks dashboard
- Location: https://dashboards.gitlab.net/d/redis-tracechunks-main/redis-tracechunks-overview?orgId=1
- What changes to this metric should prompt a rollback:
- Not seeing a few connections and the ping operations from the verification steps
- Seeing a large amount of network traffic, RPS, and operations (beyond base replication traffic) implying it is being used unexpectedly.
- Metric: All metrics on the shard state dashboard
- Location: https://dashboards.gitlab.net/d/redis-main/redis-overview?orgId=1
- What changes to this metric should prompt a rollback:
- A substantial drop, particularly in RPS, operationrates, and especially network traffic (trace chunk traffic accounts for roughly 50% of current throughput on the shared state Redis; a drop would imply traffic has unexpectedly moved to the tracechunks cluster.
Summary of infrastructure changes
-
Does this change introduce new compute instances? -
Does this change re-size any existing compute instances? -
Does this change introduce any additional usage of tooling like Elastic Search, CDNs, Cloudflare, etc?
None of the above
Changes checklist
-
This issue has a criticality label (e.g. C1, C2, C3, C4) and a change-type label (e.g. changeunscheduled, changescheduled) based on the Change Management Criticalities. -
This issue has the change technician as the assignee. -
Pre-Change, Change, Post-Change, and Rollback steps and have been filled out and reviewed. -
Necessary approvals have been completed based on the Change Management Workflow. -
Change has been tested in staging and results noted in a comment on this issue. -
A dry-run has been conducted and results noted in a comment on this issue. -
SRE on-call has been informed prior to change being rolled out. (In #production channel, mention @sre-oncall
and this issue and await their acknowledgement.) -
There are currently no active incidents.