2020-06-30: #22117: Firing 2 - IncreasedErrorRateOtherBackends
/label incident IncidentActive
Summary
#22117: Firing 2 - IncreasedErrorRateOtherBackends
Pagerduty alert https://gitlab.pagerduty.com/incidents/POFZ8EK solely affecting api-04
Timeline
All times UTC.
2020-06-30
- 00:25 - @ggillies notices that api-04 is not showing any state in haproxy, and unable to connect to the box, he reboots it through gcloud console
- 00:29 - Alert https://gitlab.pagerduty.com/incidents/POFZ8EK is fired
- 00:35 - @ggillies uses
set-server-stateto set api-04 to drain for api and ci-api frontends - 00:37 - https://gitlab.pagerduty.com/incidents/PRVHF9N is fired
- 00:56 - ggillies declares incident in Slack using
/incident declarecommand. - 00:57 - https://gitlab.pagerduty.com/incidents/P3ULFNP is fired At this point monitoring of api-04 reveals that haproxy is still sending traffic to it, despite it being marked as drain (confirmed multiple times)
- 00:57 - @ggillies does a
gitlab-ctl stopon api-04 as no other method is causing haproxy to stop sending traffic to it - 00:59 - All alerts are resolved
- 01:29 - @ggillies puts api-04 into
maintmode in haproxy usingbundle exec ./bin/set-server-state gprd maint api-04confirms multiple times it is in maint, then doesgitlab-ctl starton api-04 - 01:30 - Alert https://gitlab.pagerduty.com/incidents/P5SZHPS fires and @ggillies notices that api-04 is somehow receiving traffic and throwing errors. @ggillies immediately does a
gitlab-ctl stopon the node again - 01:35 - Alert https://gitlab.pagerduty.com/incidents/P5SZHPS resolves
Server is still apparently in `maint` mode
$ bundle exec ./bin/get-server-state gprd api-04
Fetching server state...
8 fe api/api-04-sv-gprd: MAINT
3 fe-ci ci_api/api-04-sv-gprd: MAINT
Click to expand or collapse the Incident Review section.
Incident Review
Summary
- Service(s) affected:
- Team attribution:
- Minutes downtime or degradation:
Metrics
Customer Impact
- Who was impacted by this incident? (i.e. external customers, internal customers)
- What was the customer experience during the incident? (i.e. preventing them from doing X, incorrect display of Y, ...)
- How many customers were affected?
- If a precise customer impact number is unknown, what is the estimated potential impact?
Incident Response Analysis
- How was the event detected?
- How could detection time be improved?
- How did we reach the point where we knew how to mitigate the impact?
- How could time to mitigation be improved?
Post Incident Analysis
- How was the root cause diagnosed?
- How could time to diagnosis be improved?
- Do we have an existing backlog item that would've prevented or greatly reduced the impact of this incident?
- Was this incident triggered by a change (deployment of code or change to infrastructure. If yes, have you linked the issue which represents the change?)?
5 Whys
Lessons Learned
Corrective Actions
Guidelines
Edited by Brent Newton