Operational issues with pages releases: healthcheck, alerting, NFS

A bit of a broad issue, but:

On 2020-02-19 gitlab-ee got upgraded on the pages nodes by chef, and pages restarted on each node in turn (luckily splayed in time, not co-ordinated). Starting at 00:34:24 the healthcheck was returning a 503 on one node at a time, for about 5 minutes each, as the pages daemon restarted and read ~100K directories to build the domain configuration.

This caused the The web-pages service (main stage) has an error-ratio exceeding SLO in #alerts-general, as it appears we are not excluding the healthcheck from the error counts, and the graph looked like this:

Also, most curiously, I was on pages-01-stor-gprd at the time, looking at an osquery issue (https://gitlab.com/gitlab-com/gl-infra/infrastructure/issues/9237) and noticed that during that the load on pages-01-stor-gprd dropped strangely, and significantly:

Note that this trails the start and completion of the upgrades (when the healthchecks stopped failing), only starting to waver after 5 nodes had upgraded, and returning to baseline at 01:42, nearly 10 minutes after the last failed healthcheck.

There are three followups required here:

Exclude the healthcheck from error-ratio recording rules
Look into whether we need to make upgrades on pages more controlled now (i.e. by the deployment process), rather than relying on chef splaying them conveniently
Deeper dive into the NFS issue; it is very strange and somewhat concerning absent any clear explanation.

/cc @jarv @skarbek for the delivery/deployment question (2) /cc @krasio for interest.