health is down

Summary

All times UTC.

2020-04-25

00:09 - @ggillies received pagerduty alert https://gitlab.pagerduty.com/incidents/P8P88HF saying snowplow.trx.gitlab.net was down
00:40 - @ggillies spent time locating the machines in aws, obtaining access to one (ssh key in 1password), and logging onto the instance to debug the problem
00:42 - @ggillies found the problem in the cloud-init user-data.sh script which sets up the instance. He manually fixed one node so that it would report healthy
00:44 - @ggillies alert https://gitlab.pagerduty.com/incidents/P8P88HF resolved
00:47 - @ggillies opened MR https://ops.gitlab.net/gitlab-com/gitlab-com-infrastructure/-/merge_requests/1640 to fix the issue
00:49 - Pagerduty alert https://gitlab.pagerduty.com/incidents/P70ITHQ fired as all instances in autoscaling group were restarting due to failing health checks (overloaded)
00:49 - Incident declared from Slack
01:00 - @ggillies merged MR and did a terraform apply to apply it to running configuration
01:08 - @cindy opened MR https://ops.gitlab.net/gitlab-com/gitlab-com-infrastructure/-/merge_requests/1641 to correct the issue in more places (and because @ggillies original MR wasn't correct). @ggillies approves
01:12 - @cindy applies MR https://ops.gitlab.net/gitlab-com/gitlab-com-infrastructure/-/merge_requests/1641 via terraform
01:14 - Pagerduty alert https://gitlab.pagerduty.com/incidents/P70ITHQ?utm_source=slack&utm_campaign=channel resolved

Incident declared by ggillies in Slack via /incident declare command.

If the Situation Zoom room was utilised, recording will be automatically uploaded to Incident room Google Drive folder (private)

Edited Apr 24, 2020 by Graeme Gillies