2023-05-16: test that Redis and Sentinel start up on machine boot in gstg after removing downfiles

Production Change

Change Summary

Some time ago we had an incident that was exacerbated by the fact that the Redis and Sentinel services controlled by Omnibus were not started on machine boot. This meant recovery took longer than expected.

Subsequent investigation and changes are tracked in this CA. However, despite applying the suggested Chef config changes that were supposed to remediate this problem, testing in gstg revealed that both services still failed to start on machine boot, and caused an incident in staging.

Further investigation revealed that the cause of the services failing to start on boot is most likely due to the presence of down files for the services: https://gitlab.com/gitlab-com/gl-infra/reliability/-/issues/17075#note_1377784222. These files were created by Chef, but for some unknown reason is no longer controlled by Chef.

We want to test that by removing the down files by way of an Ansible playbook, Redis and Sentinel start on machine boot as expected. The test will be done in gstg - if the test succeeds there, it gives us confidence that the fix will work in gprd as well, without having to incur possible downtime in production. If the test does not succeed then we'll know that further investigation is needed.

Change Details

Services Impacted - ServiceRedis ServiceRedisCache ServiceRedisCacheSentinel ServiceRedisDbLoadBalancing ServiceRedisRegistryCache ServiceRedisRepositoryCache ServiceRedisSessions ServiceRedisSidekiq ServiceRedisTraceChunks (I do NOT expect to incur downtime on ALL of these services as testing on one Redis cluster will be enough, but they will ALL have changes applied to them)
Change Technician - @ayeung
Change Reviewer - @gsgl
Time tracking - 25 minutes
Downtime Component - YES

Detailed steps for the change

Change Steps - steps to take to execute the change

Estimated Time to Complete (mins) - 30 minutes

Ensure you've cloned https://gitlab.com/gitlab-com/gl-infra/ansible-workloads/misc-playbooks and pulled the latest changes on main
Coordinate with @release-managers to ensure that the CR can be started. There should be no ongoing deployments in gstg - delay starting the CR until this is the case.
Set label changein-progress /label ~change::in-progress
Change the hosts key in the redis-downfiles playbook to hosts: gstg so that the playbook is applied to the gstg hosts in the inventory. Save.
Run ansible-playbook playbook.yml -i hosts and leave the output as a comment on this CR
Check a subset of hosts to ensure that the downfile has been removed by running bundle exec knife ssh --no-host-key-verify roles:gstg-base-db-redis "file /opt/gitlab/service/redis/down"
Reboot test on a secondary node in the main Redis cluster
- Log into the primary (master) and one of the secondary (slave) nodes in the cluster. The shell prompt should helpfully state Staging PRIMARY-REDIS or Staging secondary-redis when you log in, but to be sure run sudo gitlab-redis-cli then info. Under the Replication section, role should be master or slave respectively.
- Ensure master_link_status is up then exit gitlab-redis-cli.
- Do another Chef run to ensure that the down file is not replaced by Chef. If it is, abort the CR.
- Tail the Redis log on the master node: tail -f /var/log/gitlab/redis/current.
- Run sudo reboot on the slave. The log on the master node should show something like
```
2023-05-16_02:01:59.81570 4150366:M 16 May 2023 02:01:59.815 # Connection with replica redis-repository-cache-02-db-pre.c.gitlab-pre.internal:6379 lost.
```
- Some time later the log should show something like
```
2023-05-16_02:03:34.56877 4150366:M 16 May 2023 02:03:34.568 * Replica redis-repository-cache-02-db-pre.c.gitlab-pre.internal:6379 asks for synchronization
```
  which shows that the secondary node has successfully rebooted and started Redis/Sentinel with no user intervention. After that it should show something like
```
2023-05-16_02:03:34.67781 4150366:M 16 May 2023 02:03:34.677 * Synchronization with replica redis-repository-cache-02-db-pre.c.gitlab-pre.internal:6379 succeeded
```
  This means the test was successful. Make a note of the time elapsed between the reboot and the synchronisation succeeding in the CR comments. If the test was unsuccessful (no log message showing the slave node rebooted and rejoined the cluster), abort the CR.
Wait 10 minutes for the state to stabilise.

Reboot test on the master node in the main Redis cluster

On the primary (master) node of the main cluster, run gitlab-redis-cli and then info. Ensure that slave0 and slave1 both say state=online and lag=0. Exit gitlab-redis-cli.
Tail the Redis log on one of the slave nodes: tail -f /var/log/gitlab/redis/current.

Run sudo reboot on the master. The log on the slave node should show something like

2023-05-16_05:45:30.25956 3361342:S 16 May 2023 05:45:30.259 # Connection with master lost.

Some time later the log should show something like

2023-05-16_05:45:40.66328 3361342:M 16 May 2023 05:45:40.663 * Discarding previously cached master state.

2023-05-16_05:45:40.66332 3361342:M 16 May 2023 05:45:40.663 # Setting secondary replication ID to 2ae18bca918b075c28633fea9ed8a9ea261db2da, valid up to offset: 4017516749. New replication ID is 8f512a4e2280d49d61db9ce351600c358de55a07

2023-05-16_05:45:40.66332 3361342:M 16 May 2023 05:45:40.663 * MASTER MODE enabled (user request from 'id=878076 addr=10.232.6.102:35307 laddr=10.232.6.103:6379 fd=10 name=sentinel-98972655-cmd age=13341 idle=0 flags=x db=0 sub=0 psub=0 multi=4 qbuf=202 qbuf-free=40752 argv-mem=4 obl=45 oll=0 omem=0 tot-mem=61468 events=r cmd=exec user=default redir=-1')

which shows that one of the slaves was successfully promoted to master. After that it should show something like

2023-05-16_05:47:11.24504 3361342:M 16 May 2023 05:47:11.245 * Synchronization with replica redis-repository-cache-01-db-pre.c.gitlab-pre.internal:6379 succeeded

This means the test was successful as the previous master rebooted and rejoined the cluster as a slave node. Make a note of the time elapsed between the reboot, slave being promoted to master and rebooted machine rejoining the cluster in the CR comments.

Set label changecomplete /label ~change::complete

Rollback

Rollback steps - steps to be taken in the event of a need to rollback this change

Estimated Time to Complete (mins) - 5 minutes

If either of the nodes do NOT come up with Redis and Sentinel after 5 minutes of rebooting:

For the slave, log into the machine and start the services manually by running gitlab-ctl start redis and gitlab-ctl start sentinel.
For the master, log into one of the slaves, run sudo gitlab-redis-cli and then slaveof no one to promote it to master. Then log into the machine that was just rebooted (the old master) and start the services manually by running gitlab-ctl start redis and gitlab-ctl start sentinel.
Set label changeaborted /label ~change::aborted

I don't think it's necessary to replace the deleted files as they serve no purpose outside of preventing the services from running on boot.

Monitoring

Key metrics to observe

Metric: gstg Redis client exceptions
- Location: https://thanos.gitlab.net/graph?g0.expr=rate(gitlab_redis_client_exceptions_total%7Benv%3D%22gstg%22%7D%5B1m%5D)&g0.tab=0&g0.stacked=0&g0.range_input=2d&g0.max_source_resolution=0s&g0.deduplicate=1&g0.partial_response=0&g0.store_matches=%5B%5D
- What changes to this metric should prompt a rollback: We expect to see a large increase in exceptions immediately following the reboot of the master node, then recovery when one of the slaves is promoted. If the metric spikes when the slave is rebooted, or recovery takes longer than 5 minutes, do not proceed with the CR.

Change Reviewer checklist

C4 C3 C2 C1:

Check if the following applies:
- The scheduled day and time of execution of the change is appropriate.
- The change plan is technically accurate.
- The change plan includes estimated timing values based on previous testing.
- The change plan includes a viable rollback plan.
- The specified metrics/monitoring dashboards provide sufficient visibility for the change.

C2 C1:

Check if the following applies:
- The complexity of the plan is appropriate for the corresponding risk of the change. (i.e. the plan contains clear details).
- The change plan includes success measures for all steps/milestones during the execution.
- The change adequately minimizes risk within the environment/service.
- The performance implications of executing the change are well-understood and documented.
- The specified metrics/monitoring dashboards provide sufficient visibility for the change.
  - If not, is it possible (or necessary) to make changes to observability platforms for added visibility?
- The change has a primary and secondary SRE with knowledge of the details available during the change window.
- The labels blocks deployments and/or blocks feature-flags are applied as necessary

Change Technician checklist

Check if all items below are complete:
- The change plan is technically accurate.
- This Change Issue is linked to the appropriate Issue and/or Epic
- Change has been tested in staging and results noted in a comment on this issue.
- A dry-run has been conducted and results noted in a comment on this issue.
- The change execution window respects the Production Change Lock periods.
- For C1 and C2 change issues, the change event is added to the GitLab Production calendar.
- For C1 and C2 change issues, the SRE on-call has been informed prior to change being rolled out. (In #production channel, mention @sre-oncall and this issue and await their acknowledgement.)
- For C1 and C2 change issues, the SRE on-call provided approval with the eoc_approved label on the issue.
- For C1 and C2 change issues, the Infrastructure Manager provided approval with the manager_approved label on the issue.
- Release managers have been informed (If needed! Cases include DB change) prior to change being rolled out. (In #production channel, mention @release-managers and this issue and await their acknowledgment.)
- There are currently no active incidents that are severity1 or severity2
- If the change involves doing maintenance on a database host, an appropriate silence targeting the host(s) should be added for the duration of the change.

Edited May 24, 2023 by Adeline Yeung