Hook onto GCP instance maintenance events
During production#3891 (closed) we discovered, that we can react to imminent maintenance events by hooking into a GCP API.
I wrote a little script that registers itself and listens for those events.
If everything is okay, we run all executable files in /etc/gcp_watchdog/hooks/up, if there is a problem/maintenance we run them in /etc/gcp_watchdog/hooks/maint.
Every script will get the event (or UP for okay events) as argv1. Those can also be symlinks
/dev/shm/gcp_watchdog.state will contain the state/last event or UP.
If the metadata API is unavailable, a maintenance will be assumed and metadata_unavailable_without_state will be passed as the event. If there is a state saved in the current execution, that will be used instead.
Only a change in state will cause hooks to run.
We should seek to deploy this everywhere. We can then add some cheap scripts to see if it triggers as expected before hooking into production services.
#!/usr/bin/env bash
set -uf
STATE=/dev/shm/gcp_watchdog.state
HOOK_DIR=/etc/gcp_watchdog/hooks
# Clean the state. Others might consume this, but we want to always be in the know of what happens.
touch "${STATE}";
truncate -s0 "${STATE}"
handle_event () {
local event="$1"
local subdir="$2"
if does_state_match "${event}"; then
# We are already there.
return 0;
fi
echo "Got new event: ${event}"
# Find all executable hooks and execute 16 at a time, passing the event as argv1. OR-ing true to prevent xargs from qutting on error.
find "${HOOK_DIR}/${subdir}" -executable -print0 | xargs -0rn1 -P16 -I% -- sh -c "% ${event} || true";
set_state "${event}"
}
set_state () {
local event="$1"
echo "${event}" > "${STATE}"
}
does_state_match () {
if [ "$(cat "${STATE}" 2>/dev/null)" == "$1" ]; then
return 0;
else
return 1;
fi
}
response="$(curl -s http://metadata.google.internal/computeMetadata/v1/instance/maintenance-event -H 'Metadata-Flavor: Google')"
while true; do
if [ -z "${response}" ]; then
# request failed
echo request failed
if [ -z "$(cat "${STATE}" 2>/dev/null)" ]; then
echo "no state registered. Assuming maintenance mode"
handle_event "metadata_unavailable_without_state" "maint"
fi
# Do not wait for changes, but fetch value ASAP
response="$(curl -s http://metadata.google.internal/computeMetadata/v1/instance/maintenance-event -H 'Metadata-Flavor: Google')"
sleep 1;
continue;
elif [ "NONE" == "${response}" ]; then
# all clear, poll again
handle_event "UP" "up"
else
handle_event "${response}" "maint";
fi
sleep 1;
# For subsequent requests: Poll
response="$(curl -s http://metadata.google.internal/computeMetadata/v1/instance/maintenance-event?wait_for_change=true -H 'Metadata-Flavor: Google')"
done