increased error rates in the pages service
Summary
increased error rates in the pages service
Timeline
All times UTC.
2020-04-25
- 10:53 - Incident declared from Slack
- 11:00 - Incident determined to be similar to #1973 (closed)
- 11:05 - Team begins pulling diagnostic information from the node, to allow for root cause analysis after the node is drained of traffic and re-added to the load balancer.
- 11:36 - Draining of traffic begins.
- 11:43 - The two affected nodes
web-pages-06
andweb-pages-03
are fully drained of traffic and error rate is back to normal - 11:50 -
web-pages-06
andweb-pages-03
are to be readded back. - 11:57 -
web-pages-03
was re-added to the haproxy pool - 12:00 -
web-pages-06
was re-added to the haproxy pool - 12:01 - application of changes in https://ops.gitlab.net/gitlab-cookbooks/chef-repo/-/merge_requests/3233 begin to be applied to all nodes, two by two.
- 12:46 - changes fully applied and incident call ended.
Details
increased error rates in the pages service
Source
Incident declared by mwasilewski in Slack via /incident declare
command.
Resources
- If the Situation Zoom room was utilised, recording will be automatically uploaded to Incident room Google Drive folder (private)
Edited by Brent Newton