Hook onto GCP instance maintenance events

During production#3891 (closed) we discovered, that we can react to imminent maintenance events by hooking into a GCP API.

I wrote a little script that registers itself and listens for those events.

If everything is okay, we run all executable files in /etc/gcp_watchdog/hooks/up, if there is a problem/maintenance we run them in /etc/gcp_watchdog/hooks/maint.
Every script will get the event (or UP for okay events) as argv1. Those can also be symlinks

/dev/shm/gcp_watchdog.state will contain the state/last event or UP.

If the metadata API is unavailable, a maintenance will be assumed and metadata_unavailable_without_state will be passed as the event. If there is a state saved in the current execution, that will be used instead.

Only a change in state will cause hooks to run.

We should seek to deploy this everywhere. We can then add some cheap scripts to see if it triggers as expected before hooking into production services.

#!/usr/bin/env bash

set -uf

STATE=/dev/shm/gcp_watchdog.state
HOOK_DIR=/etc/gcp_watchdog/hooks

# Clean the state. Others might consume this, but we want to always be in the know of what happens.
touch "${STATE}";
truncate -s0 "${STATE}"

handle_event () {
  local event="$1"
  local subdir="$2"

  if does_state_match "${event}"; then
    # We are already there.
    return 0;
  fi

  echo "Got new event: ${event}"

  # Find all executable hooks and execute 16 at a time, passing the event as argv1. OR-ing true to prevent xargs from qutting on error.
  find "${HOOK_DIR}/${subdir}" -executable -print0 | xargs -0rn1 -P16 -I% -- sh -c "% ${event} || true";

  set_state "${event}"
}

set_state () {
  local event="$1"

  echo "${event}" > "${STATE}"
}

does_state_match () {
  if [ "$(cat "${STATE}" 2>/dev/null)" == "$1" ]; then
    return 0;
  else
    return 1;
  fi
}

response="$(curl -s http://metadata.google.internal/computeMetadata/v1/instance/maintenance-event -H 'Metadata-Flavor: Google')"
while true; do
  if [ -z "${response}" ]; then
    # request failed
    echo request failed
    if [ -z "$(cat "${STATE}" 2>/dev/null)" ]; then
      echo "no state registered. Assuming maintenance mode"
      handle_event "metadata_unavailable_without_state" "maint"
    fi
    # Do not wait for changes, but fetch value ASAP
    response="$(curl -s http://metadata.google.internal/computeMetadata/v1/instance/maintenance-event -H 'Metadata-Flavor: Google')"
    sleep 1;
    continue;
  elif [ "NONE" == "${response}" ]; then
    # all clear, poll again
    handle_event "UP" "up"
  else
    handle_event "${response}" "maint";
  fi
  sleep 1;
  # For subsequent requests: Poll
  response="$(curl -s http://metadata.google.internal/computeMetadata/v1/instance/maintenance-event?wait_for_change=true -H 'Metadata-Flavor: Google')"
done
Edited by Hendrik Meyer (xLabber)