2019-12-05: Several apdex SLO violation alerts encountered; users reporting slow-downs
Summary
Encountered multiple apdex SLO violation alerts. Multiple reports from users including Cynthia and Gerir about slowness of the GitLab web application.
Working hypothesis:
EOCs have been seeing canary
stage apdex SLO violations throughout the week. The alerts associated with this incident included a main
stage gitaly alert, and examining the load for gitaly, we can see that file-02
continues to exhibit higher load than any other gitaly node.
In fact, file-02
has been the subject of some concern for many of our engineers lately, and we are working under the premise that the gitlab-org/gitlab
project is being a noisy neighbor. Work is underway to migrate that repository to its own gitaly canary node.
In summary, we hypothesize that high load on file-02
has been particularly disruptive for some the canary
stage web components. We expect such disruptions to continue at least until we have resolved the migration of the gitlab-org/gitlab
project to a new node, after which time, we will re-assess the situation, having eliminated our primary suspect.
Alternative hypothesis:
@craigf notes:
ah it looks like the bad-cny alert is correlated with a deployment. Could the problem described in https://gitlab.com/gitlab-com/gl-infra/infrastructure/issues/8507 affect deployments too? That would be troubling, as I thought app tier nodes were removed from LB consideration and drained before the server process is bounced / reloaded
Timeline
All times UTC.
2019-12-05
- 15:56 - Multiple apdex SLO violation alerts incoming
- 15:58 - All apdex SLO violation alerts resolved