Increased CPU load on web nodes
Please note: if the incident relates to sensitive data, or is security related consider labeling this issue with security and mark it confidential.
-
Main slack thread: https://gitlab.slack.com/archives/C8HG8D9MY/p1558080883153200 in
#backend
- Slack thread in
#g_verify
https://gitlab.slack.com/archives/C0SFP840G/p1558084380233200
Summary
A brief summary of what happened. Try to make it as executive-friendly as possible.
Service(s) affected : Team attribution : Minutes downtime or degradation :
- Post deployment patch - https://ops.gitlab.net/gitlab-com/gl-infra/patcher/merge_requests/88
Timeline
2019-05-16
- 20:27 UTC deployer: Marin Jankovski is starting a deploy pipeline of 11.11.0-rc2.ee.0 on gprd
2019-05-17
- 00:03 UTC patcher: Alex Hanselka is starting a deploy pipeline of post-deployment-patch on gstg
- 00:12 UTC spike in CPU utilization on all web nodes in gprd
- 00:19 UTC patcher: Alex Hanselka finished a deploy of post-deployment-patch on gstg
- 00:19 UTC patcher: Alex Hanselka is starting a deploy pipeline of post-deployment-patch on cny
- 00:23 UTC patcher: Alex Hanselka finished a deploy of post-deployment-patch on cny
- 00:25 UTC patcher: Alex Hanselka is starting a deploy pipeline of post-deployment-patch on gprd
- 00:51 UTC deployer: Marin Jankovski finished a deploy of 11.11.0-rc2.ee.0 on gprd
- 01:27 UTC patcher: Alex Hanselka finished a deploy of post-deployment-patch on gprd
- 07:37 UTC HighCPU alerts on web nodes
- 07:50 UTC GitLabComLatencyWebCritical alerts
- 08:20 UTC status.io incident opened
- 08:53 UTC blocking all paths that end with
deploy_keys.json
in HAProxy to no effect - 10:34 UTC deployer: John Jarvis is starting a deploy pipeline of 11.11.0-rc1.ee.0 on gprd (Rollback)
- 14:35 UTC ha-ctl process killed manually to make the rollback deployment pipeline move again
- 14:50 UTC GitLabComLatencyWebCritical resolved
- 15:26 UTC status.io incident resolved
Edited by 🤖 GitLab Bot 🤖