2023-05-16: test that Redis and Sentinel start up on machine boot in gstg after removing downfiles
Production Change
Change Summary
Some time ago we had an incident that was exacerbated by the fact that the Redis and Sentinel services controlled by Omnibus were not started on machine boot. This meant recovery took longer than expected.
Subsequent investigation and changes are tracked in this CA. However, despite applying the suggested Chef config changes that were supposed to remediate this problem, testing in gstg
revealed that both services still failed to start on machine boot, and caused an incident in staging.
Further investigation revealed that the cause of the services failing to start on boot is most likely due to the presence of down
files for the services: https://gitlab.com/gitlab-com/gl-infra/reliability/-/issues/17075#note_1377784222. These files were created by Chef, but for some unknown reason is no longer controlled by Chef.
We want to test that by removing the down
files by way of an Ansible playbook, Redis and Sentinel start on machine boot as expected. The test will be done in gstg
- if the test succeeds there, it gives us confidence that the fix will work in gprd
as well, without having to incur possible downtime in production. If the test does not succeed then we'll know that further investigation is needed.
Change Details
- Services Impacted - ServiceRedis ServiceRedisCache ServiceRedisCacheSentinel ServiceRedisDbLoadBalancing ServiceRedisRegistryCache ServiceRedisRepositoryCache ServiceRedisSessions ServiceRedisSidekiq ServiceRedisTraceChunks (I do NOT expect to incur downtime on ALL of these services as testing on one Redis cluster will be enough, but they will ALL have changes applied to them)
-
Change Technician -
@ayeung
- Change Reviewer - @gsgl
- Time tracking - 25 minutes
- Downtime Component - YES
Detailed steps for the change
Change Steps - steps to take to execute the change
Estimated Time to Complete (mins) - 30 minutes
-
Ensure you've cloned https://gitlab.com/gitlab-com/gl-infra/ansible-workloads/misc-playbooks and pulled the latest changes on main
-
Coordinate with @release-managers
to ensure that the CR can be started. There should be no ongoing deployments ingstg
- delay starting the CR until this is the case. -
Set label changein-progress /label ~change::in-progress
-
Change the hosts
key in theredis-downfiles
playbook tohosts: gstg
so that the playbook is applied to thegstg
hosts in the inventory. Save. -
Run ansible-playbook playbook.yml -i hosts
and leave the output as a comment on this CR -
Check a subset of hosts to ensure that the downfile has been removed by running bundle exec knife ssh --no-host-key-verify roles:gstg-base-db-redis "file /opt/gitlab/service/redis/down"
-
Reboot test on a secondary node in the main Redis cluster -
Log into the primary (master) and one of the secondary (slave) nodes in the cluster. The shell prompt should helpfully state Staging PRIMARY-REDIS
orStaging secondary-redis
when you log in, but to be sure runsudo gitlab-redis-cli
theninfo
. Under the Replication section,role
should bemaster
orslave
respectively. -
Ensure master_link_status
isup
then exitgitlab-redis-cli
. -
Do another Chef run to ensure that the down
file is not replaced by Chef. If it is, abort the CR. -
Tail the Redis log on the master node: tail -f /var/log/gitlab/redis/current
. -
Run sudo reboot
on the slave. The log on the master node should show something like2023-05-16_02:01:59.81570 4150366:M 16 May 2023 02:01:59.815 # Connection with replica redis-repository-cache-02-db-pre.c.gitlab-pre.internal:6379 lost.
-
Some time later the log should show something like 2023-05-16_02:03:34.56877 4150366:M 16 May 2023 02:03:34.568 * Replica redis-repository-cache-02-db-pre.c.gitlab-pre.internal:6379 asks for synchronization
2023-05-16_02:03:34.67781 4150366:M 16 May 2023 02:03:34.677 * Synchronization with replica redis-repository-cache-02-db-pre.c.gitlab-pre.internal:6379 succeeded
-
-
Wait 10 minutes for the state to stabilise. -
Reboot test on the master node in the main Redis cluster -
On the primary (master) node of the main cluster, run gitlab-redis-cli
and theninfo
. Ensure thatslave0
andslave1
both saystate=online
andlag=0
. Exitgitlab-redis-cli
. -
Tail the Redis log on one of the slave nodes: tail -f /var/log/gitlab/redis/current
. -
Run sudo reboot
on the master. The log on the slave node should show something like2023-05-16_05:45:30.25956 3361342:S 16 May 2023 05:45:30.259 # Connection with master lost.
-
Some time later the log should show something like 2023-05-16_05:45:40.66328 3361342:M 16 May 2023 05:45:40.663 * Discarding previously cached master state. 2023-05-16_05:45:40.66332 3361342:M 16 May 2023 05:45:40.663 # Setting secondary replication ID to 2ae18bca918b075c28633fea9ed8a9ea261db2da, valid up to offset: 4017516749. New replication ID is 8f512a4e2280d49d61db9ce351600c358de55a07 2023-05-16_05:45:40.66332 3361342:M 16 May 2023 05:45:40.663 * MASTER MODE enabled (user request from 'id=878076 addr=10.232.6.102:35307 laddr=10.232.6.103:6379 fd=10 name=sentinel-98972655-cmd age=13341 idle=0 flags=x db=0 sub=0 psub=0 multi=4 qbuf=202 qbuf-free=40752 argv-mem=4 obl=45 oll=0 omem=0 tot-mem=61468 events=r cmd=exec user=default redir=-1')
2023-05-16_05:47:11.24504 3361342:M 16 May 2023 05:47:11.245 * Synchronization with replica redis-repository-cache-01-db-pre.c.gitlab-pre.internal:6379 succeeded
-
-
Set label changecomplete /label ~change::complete
Rollback
Rollback steps - steps to be taken in the event of a need to rollback this change
Estimated Time to Complete (mins) - 5 minutes
If either of the nodes do NOT come up with Redis and Sentinel after 5 minutes of rebooting:
-
For the slave, log into the machine and start the services manually by running gitlab-ctl start redis
andgitlab-ctl start sentinel
. -
For the master, log into one of the slaves, run sudo gitlab-redis-cli
and thenslaveof no one
to promote it to master. Then log into the machine that was just rebooted (the old master) and start the services manually by runninggitlab-ctl start redis
andgitlab-ctl start sentinel
. -
Set label changeaborted /label ~change::aborted
I don't think it's necessary to replace the deleted files as they serve no purpose outside of preventing the services from running on boot.
Monitoring
Key metrics to observe
- Metric:
gstg
Redis client exceptions- Location: https://thanos.gitlab.net/graph?g0.expr=rate(gitlab_redis_client_exceptions_total%7Benv%3D%22gstg%22%7D%5B1m%5D)&g0.tab=0&g0.stacked=0&g0.range_input=2d&g0.max_source_resolution=0s&g0.deduplicate=1&g0.partial_response=0&g0.store_matches=%5B%5D
- What changes to this metric should prompt a rollback: We expect to see a large increase in exceptions immediately following the reboot of the master node, then recovery when one of the slaves is promoted. If the metric spikes when the slave is rebooted, or recovery takes longer than 5 minutes, do not proceed with the CR.
Change Reviewer checklist
-
Check if the following applies: - The scheduled day and time of execution of the change is appropriate.
- The change plan is technically accurate.
- The change plan includes estimated timing values based on previous testing.
- The change plan includes a viable rollback plan.
- The specified metrics/monitoring dashboards provide sufficient visibility for the change.
-
Check if the following applies: - The complexity of the plan is appropriate for the corresponding risk of the change. (i.e. the plan contains clear details).
- The change plan includes success measures for all steps/milestones during the execution.
- The change adequately minimizes risk within the environment/service.
- The performance implications of executing the change are well-understood and documented.
- The specified metrics/monitoring dashboards provide sufficient visibility for the change.
- If not, is it possible (or necessary) to make changes to observability platforms for added visibility?
- The change has a primary and secondary SRE with knowledge of the details available during the change window.
- The labels blocks deployments and/or blocks feature-flags are applied as necessary
Change Technician checklist
-
Check if all items below are complete: - The change plan is technically accurate.
- This Change Issue is linked to the appropriate Issue and/or Epic
- Change has been tested in staging and results noted in a comment on this issue.
- A dry-run has been conducted and results noted in a comment on this issue.
- The change execution window respects the Production Change Lock periods.
- For C1 and C2 change issues, the change event is added to the GitLab Production calendar.
- For C1 and C2 change issues, the SRE on-call has been informed prior to change being rolled out. (In #production channel, mention
@sre-oncall
and this issue and await their acknowledgement.) - For C1 and C2 change issues, the SRE on-call provided approval with the eoc_approved label on the issue.
- For C1 and C2 change issues, the Infrastructure Manager provided approval with the manager_approved label on the issue.
- Release managers have been informed (If needed! Cases include DB change) prior to change being rolled out. (In #production channel, mention
@release-managers
and this issue and await their acknowledgment.) - There are currently no active incidents that are severity1 or severity2
- If the change involves doing maintenance on a database host, an appropriate silence targeting the host(s) should be added for the duration of the change.