increased error rates in the pages service

Summary

All times UTC.

2020-04-25

10:53 - Incident declared from Slack
11:00 - Incident determined to be similar to #1973 (closed)
11:05 - Team begins pulling diagnostic information from the node, to allow for root cause analysis after the node is drained of traffic and re-added to the load balancer.
11:36 - Draining of traffic begins.
11:43 - The two affected nodes web-pages-06 and web-pages-03 are fully drained of traffic and error rate is back to normal
11:50 - web-pages-06 and web-pages-03 are to be readded back.
11:57 - web-pages-03 was re-added to the haproxy pool
12:00 - web-pages-06 was re-added to the haproxy pool
12:01 - application of changes in https://ops.gitlab.net/gitlab-cookbooks/chef-repo/-/merge_requests/3233 begin to be applied to all nodes, two by two.
12:46 - changes fully applied and incident call ended.

increased error rates in the pages service

Incident declared by mwasilewski in Slack via /incident declare command.

If the Situation Zoom room was utilised, recording will be automatically uploaded to Incident room Google Drive folder (private)

Edited Apr 24, 2020 by Brent Newton