In production#2570 (comment 401672637) we saw that disabling chef-client with disable-chef-client wasn't preserved over a reboot - which unexpectedly brought the DB back into rotation with a high replication delay.
We need to figure out why it isn't preserved over reboots (the node was in an unhealthy state, so maybe it just failed to write changes to disk in this one case and works in general - need to test) and fix it.
The simplest direct approach is to make the bootstrap script aware of the disabler (check for the disabled binary in /usr/bin/chef-client-alias-when-disabled perhaps?) and not install in that case. However, the shutdown script also de-registers the node from chef so rebooting the node would remove it from chef's database until we get around to re-enabling chef properly. That may be acceptable; these situations tend to be edge-case scenarios where, after the reboot, we'll be performing manual actions with a goal to either re-enable chef and the node in due course, or destroy it permanently. There's a small risk we could end up with orphaned nodes, but that could happen now for a number of other similar reasons so it's probably not a huge problem.
@hphilipps@cmiskell Are we not running the chef-client as a systemd service? If the file is present at boot, we can use ConditionPathExists with the negation, to not start that service, if the file exists.
So yes, it does run as systemd service, but that still doesn't prevent the bootstrap script from manually installing and then invoking chef-client at startup.
@cmiskell I think the deregister dance we do on shutdown is problematic in this case. If we did not have that, we could simply omit the install after the initial bootstrap. But the node not being registered in chef if the client was disabled and the node being shutdown makes this a tough nut to crack.
I am still not sure why we de-register automatically anyway. We're not auto-scaling, and when deleting a node, we should just make this a scripted step. The auto-deregister is nice but also does not work all the time (had it fail a couple of times already), so we need to look and clean manually still. Removing that would allow us to at least easily implement a proper way of persisting a disabled chef-client across reboots.
Especially, because those nodes can turn into time-bombs. All it takes for chef to be disabled because of something, and then GCP detects a hypervisor failure and reboots the machine on another host. We had those regularly in the past, although admittedly they have become a bit rarer. Uncontrolled chef-changes like these can cause problems that might be very hard to debug.
This is potentially a big scope issue. We're going to time box to investigate the feasibility. This issue will be complete when the determination is made how feasible and how much work this will involve. If feasible with a reasonable effort, we will create a new issue to capture that work.
@cmcfarland It is true that chef-client is stopping the default Ubuntu postgresql service (because postgres needs to be managed by Patroni) and that chef also is not setting the patroni service to be enabled.
But chef is restarting the Patroni service if it needs to change patroni.yml and to prevent this we disabled chef-client (as we manually set the noloadbalance tag in patroni.yml to take it out of rotation to reboot it). So it was an unpleasant surprise that chef-client was enabled again after the reboot.
Speaking generally: disabling chef-client is not really save, as any (intended or unintended) reboot will enable it again.
The narrowest change I can think of is what was mentioned above, essentially.
Update the teardown script to detect the shim and take no chef action because of it.
Update the bootstrap script to detect the shim and take no chef action because of it.
Update the scripts on patroni nodes and test them.
This would mean that a disabled chef node could be restarted with little impact aside from the following:
Since a restarted/stopped node is still in the chef inventory, there may be alerts about it sent to EOC.
A deleted node after having chef disabled would not be purged from the inventory and chef client and nodes would need to be deleted by hand after such an event.
The only alternative I can imagine is to have the chef disable shim create a lockfile that the patroni recipe can detect after a reboot and decide not to start postgresql when that lockfile is noticed.
A deleted node after having chef disabled would not be purged from the inventory and chef client and nodes would need to be deleted by hand after such an event.
As a long-tail follow-up, we should watch for "stale" chef clients that haven't checked in for longer than X days, and send an alert to prompt cleanup/investigation.
@craig What do you think of a chef-repo scheduled job that just notifies of nodes that have not converged in a set amount of time. I considered a prometheus check for this, but if a node is provisioned, but never runs, it never shows up in our metrics.
My most recent attempt to test this had good results for the tear-down script, but the startup script had some bash issues. I will be attempting the change issue to test in staging again. production#5715 (closed)
This change is currently only applied to a single fleet of console in gstg. But we need to test more fleets before widening our deploy of this change. The testing would involve updating the module and performing a single, chef-enabled reboot of a node from the fleet. Which fleets should we test with?
Fleets to test with:
generic-stor: file-hdd?
generic-stor-with-group: patroni?
generic-stor-redis: redis-cache (this is the only option)
Should we pick the one that is lower impact? Or more likely to be used in production? In this case, a gitaly node in GSTG is about as impactful on a reboot test as thanos-compact might be. I suppose we would need to conduct the reboot when a deploy is not occurring, though.
Oh you are right, a Gitaly reboot in gstg is not going to be impactful as long as a deployment is not happening. We should do gitaly then, since it's the most critical between the two, and will give us more confidence for applying the change in gprd.
The job for GSTG failed due to, I think, the volume of requests to update compute instances. So I'm going to have to break this up into two MRs, I think.
@alejandro I see that you provisioned patroni-zfs in this commit about 7 months ago. Was there a specific reason you chose to use bootscript version 9 instead of 10 or the default variable for the environment?
Is this just copy-paste from the file-zfs module from 2 years ago?
@cmcfarland the bootstrap script MR was merged a couple of weeks before the zfs MR, I probably pinned it because all other envs and storage nodes were still using version 8 of the script and I didn't want to wait with testing zfs for the roll out of the script across the entire gstg fleet
If we have zfs nodes that successfully use the latest version of the bootstrap script than we can probably test unpinning this
Ok, I think I am going to need to perform a review of the changes to the script and just make sure I can properly note any changes before I change them in staging. Once the net difference is understood, I can proceed.