Increased CPU load on web nodes

Please note: if the incident relates to sensitive data, or is security related consider labeling this issue with security and mark it confidential.


  • Main slack thread: https://gitlab.slack.com/archives/C8HG8D9MY/p1558080883153200 in #backend
  • Slack thread in #g_verify https://gitlab.slack.com/archives/C0SFP840G/p1558084380233200

Summary

A brief summary of what happened. Try to make it as executive-friendly as possible.

Service(s) affected : Team attribution : Minutes downtime or degradation :

Screen_Shot_2019-05-17_at_10.04.38_AM

  • Post deployment patch - https://ops.gitlab.net/gitlab-com/gl-infra/patcher/merge_requests/88

Timeline

2019-05-16

  • 20:27 UTC deployer: Marin Jankovski is starting a deploy pipeline of 11.11.0-rc2.ee.0 on gprd

2019-05-17

  • 00:03 UTC patcher: Alex Hanselka is starting a deploy pipeline of post-deployment-patch on gstg
  • 00:12 UTC spike in CPU utilization on all web nodes in gprd
  • 00:19 UTC patcher: Alex Hanselka finished a deploy of post-deployment-patch on gstg
  • 00:19 UTC patcher: Alex Hanselka is starting a deploy pipeline of post-deployment-patch on cny
  • 00:23 UTC patcher: Alex Hanselka finished a deploy of post-deployment-patch on cny
  • 00:25 UTC patcher: Alex Hanselka is starting a deploy pipeline of post-deployment-patch on gprd
  • 00:51 UTC deployer: Marin Jankovski finished a deploy of 11.11.0-rc2.ee.0 on gprd
  • 01:27 UTC patcher: Alex Hanselka finished a deploy of post-deployment-patch on gprd
  • 07:37 UTC HighCPU alerts on web nodes
  • 07:50 UTC GitLabComLatencyWebCritical alerts
  • 08:20 UTC status.io incident opened
  • 08:53 UTC blocking all paths that end with deploy_keys.json in HAProxy to no effect
  • 10:34 UTC deployer: John Jarvis is starting a deploy pipeline of 11.11.0-rc1.ee.0 on gprd (Rollback)
  • 14:35 UTC ha-ctl process killed manually to make the rollback deployment pipeline move again
  • 14:50 UTC GitLabComLatencyWebCritical resolved
  • 15:26 UTC status.io incident resolved
Edited Aug 03, 2020 by 🤖 GitLab Bot 🤖
Assignee Loading
Time tracking Loading