Pingdom check check:https://snowplow.trx.gitlab.net/health is down
Summary
Pingdom check check:https://snowplow.trx.gitlab.net/health is down
Timeline
All times UTC.
2020-04-25
- 00:09 - @ggillies received pagerduty alert https://gitlab.pagerduty.com/incidents/P8P88HF saying snowplow.trx.gitlab.net was down
- 00:40 - @ggillies spent time locating the machines in aws, obtaining access to one (ssh key in 1password), and logging onto the instance to debug the problem
- 00:42 - @ggillies found the problem in the cloud-init
user-data.sh
script which sets up the instance. He manually fixed one node so that it would report healthy - 00:44 - @ggillies alert https://gitlab.pagerduty.com/incidents/P8P88HF resolved
- 00:47 - @ggillies opened MR https://ops.gitlab.net/gitlab-com/gitlab-com-infrastructure/-/merge_requests/1640 to fix the issue
- 00:49 - Pagerduty alert https://gitlab.pagerduty.com/incidents/P70ITHQ fired as all instances in autoscaling group were restarting due to failing health checks (overloaded)
- 00:49 - Incident declared from Slack
- 01:00 - @ggillies merged MR and did a terraform apply to apply it to running configuration
- 01:08 - @cindy opened MR https://ops.gitlab.net/gitlab-com/gitlab-com-infrastructure/-/merge_requests/1641 to correct the issue in more places (and because @ggillies original MR wasn't correct). @ggillies approves
- 01:12 - @cindy applies MR https://ops.gitlab.net/gitlab-com/gitlab-com-infrastructure/-/merge_requests/1641 via terraform
- 01:14 - Pagerduty alert https://gitlab.pagerduty.com/incidents/P70ITHQ?utm_source=slack&utm_campaign=channel resolved
Details
Incident for Pagerduty Alert https://gitlab.pagerduty.com/incidents/P8P88HF
Source
Incident declared by ggillies in Slack via /incident declare
command.
Resources
- If the Situation Zoom room was utilised, recording will be automatically uploaded to Incident room Google Drive folder (private)
Edited by Graeme Gillies