increased error rates on web-pages-06
Summary
increased error rates on web-pages-06
Timeline
All times UTC.
2020-04-22
- 11:10 - EOC is paged
- 11:30 - the relevant field in the logs indicating an issue is identified ( https://log.gprd.gitlab.net/goto/e69122cf0faad671b14827e762782380 ), determining if the service is stateless (which is a requirement for draining the nodes in HAProxy)
- 11:37 - Incident declared from Slack
- 11:38 - web-pages-06 is drained
- 11:47 - web-pages-03 is drained
- 11:49 - error rates drop to levels similar to those before the incident started
- 12:07 - gitlab-pages on web-pages-03 is restarted and the node is added to the load balancer
- 12:10 - gitlab-pages on web-pages-06 is not restarted and the node is added to the load balancer
Details
web-pages-06
is reporting the following errors: could not fetch domain information from a source
which suggests it's having a problem with connecting to api nodes
Source
Incident declared by mwasilewski in Slack via /incident declare
command.
Resources
- If the Situation Zoom room was utilised, recording will be automatically uploaded to Incident room Google Drive folder (private)
Edited by Michal Wasilewski