Operational issues with pages releases: healthcheck, alerting, NFS
A bit of a broad issue, but:
On 2020-02-19 gitlab-ee got upgraded on the pages nodes by chef, and pages restarted on each node in turn (luckily splayed in time, not co-ordinated). Starting at 00:34:24 the healthcheck was returning a 503 on one node at a time, for about 5 minutes each, as the pages daemon restarted and read ~100K directories to build the domain configuration.
This caused the The web-pages service (main stage) has an error-ratio exceeding SLO
in #alerts-general, as it appears we are not excluding the healthcheck from the error counts, and the graph looked like this:
Also, most curiously, I was on pages-01-stor-gprd at the time, looking at an osquery issue (https://gitlab.com/gitlab-com/gl-infra/infrastructure/issues/9237) and noticed that during that the load on pages-01-stor-gprd dropped strangely, and significantly:
Note that this trails the start and completion of the upgrades (when the healthchecks stopped failing), only starting to waver after 5 nodes had upgraded, and returning to baseline at 01:42, nearly 10 minutes after the last failed healthcheck.
There are three followups required here:
- Exclude the healthcheck from error-ratio recording rules
- Look into whether we need to make upgrades on pages more controlled now (i.e. by the deployment process), rather than relying on chef splaying them conveniently
- Deeper dive into the NFS issue; it is very strange and somewhat concerning absent any clear explanation.
/cc @jarv @skarbek for the delivery/deployment question (2) /cc @krasio for interest.