2019-11-06: increased latency and error rates across the fleet

Summary

At 10:30 UTC on 2019-11-06, we observed a dip in the Apdex score and spike in error ratios for web and api requests to GitLab.com. By 11 UTC, systems were again operating normally.

We are looking at an increase in slow queries that is affecting performance on the front-end, resulting in higher error rates and causing queues to back up.

A spike in slow queries appeared at 10:30 where all of GitLab.com went unavailable, we are still seeing slow queries that is continuing to result in increased error rates.

Timeline

All times UTC.

2019-11-06

10:30 - dip begins in Apdex with increase in Errors
10:38 - first pages for API service come in
10:45 - first pages for increased error rate across fleet come in
10:56 - end of observed dip
11:03 - manager on call paged
11:22 - feature flag disabled that may have been related.
14:50 - configuration change reverted for the different metrics endpoint which was made around the same time of the start incident, though we are pretty sure it is unrelated https://ops.gitlab.net/gitlab-cookbooks/chef-repo/merge_requests/2109
15:09 - canary is switched back to unicorn: https://ops.gitlab.net/gitlab-cookbooks/chef-repo/merge_requests/2111

Edited Nov 07, 2019 by Michal Wasilewski