2019-12-05: Several apdex SLO violation alerts encountered; users reporting slow-downs

Summary

Encountered multiple apdex SLO violation alerts. Multiple reports from users including Cynthia and Gerir about slowness of the GitLab web application.

Working hypothesis:

EOCs have been seeing canary stage apdex SLO violations throughout the week. The alerts associated with this incident included a main stage gitaly alert, and examining the load for gitaly, we can see that file-02 continues to exhibit higher load than any other gitaly node.

Screen_Shot_2019-12-05_at_10.06.42_AM

In fact, file-02 has been the subject of some concern for many of our engineers lately, and we are working under the premise that the gitlab-org/gitlab project is being a noisy neighbor. Work is underway to migrate that repository to its own gitaly canary node.

In summary, we hypothesize that high load on file-02 has been particularly disruptive for some the canary stage web components. We expect such disruptions to continue at least until we have resolved the migration of the gitlab-org/gitlab project to a new node, after which time, we will re-assess the situation, having eliminated our primary suspect.

Alternative hypothesis:

@craigf notes:

ah it looks like the bad-cny alert is correlated with a deployment. Could the problem described in https://gitlab.com/gitlab-com/gl-infra/infrastructure/issues/8507 affect deployments too? That would be troubling, as I thought app tier nodes were removed from LB consideration and drained before the server process is bounced / reloaded

Timeline

All times UTC.

2019-12-05

  • 15:56 - Multiple apdex SLO violation alerts incoming
  • 15:58 - All apdex SLO violation alerts resolved
Edited Dec 05, 2019 by Nels Nelson
Assignee Loading
Time tracking Loading