2019-11-06: increased latency and error rates across the fleet
Summary
At 10:30 UTC on 2019-11-06, we observed a dip in the Apdex score and spike in error ratios for web and api requests to GitLab.com. By 11 UTC, systems were again operating normally.
We are looking at an increase in slow queries that is affecting performance on the front-end, resulting in higher error rates and causing queues to back up.
A spike in slow queries appeared at 10:30 where all of GitLab.com went unavailable, we are still seeing slow queries that is continuing to result in increased error rates.
Timeline
All times UTC.
2019-11-06
- 10:30 - dip begins in Apdex with increase in Errors
- 10:38 - first pages for API service come in
- 10:45 - first pages for increased error rate across fleet come in
- 10:56 - end of observed dip
- 11:03 - manager on call paged
- 11:22 - feature flag disabled that may have been related.
- 14:50 - configuration change reverted for the different metrics endpoint which was made around the same time of the start incident, though we are pretty sure it is unrelated https://ops.gitlab.net/gitlab-cookbooks/chef-repo/merge_requests/2109
- 15:09 - canary is switched back to unicorn: https://ops.gitlab.net/gitlab-cookbooks/chef-repo/merge_requests/2111
Edited by Michal Wasilewski