2019-12-05: Several apdex SLO violation alerts encountered; users reporting slow-downs
Encountered multiple apdex SLO violation alerts. Multiple reports from users including Cynthia and Gerir about slowness of the GitLab web application.
EOCs have been seeing
canary stage apdex SLO violations throughout the week. The alerts associated with this incident included a
main stage gitaly alert, and examining the load for gitaly, we can see that
file-02 continues to exhibit higher load than any other gitaly node.
file-02 has been the subject of some concern for many of our engineers lately, and we are working under the premise that the
gitlab-org/gitlab project is being a noisy neighbor. Work is underway to migrate that repository to its own gitaly canary node.
In summary, we hypothesize that high load on
file-02 has been particularly disruptive for some the
canary stage web components. We expect such disruptions to continue at least until we have resolved the migration of the
gitlab-org/gitlab project to a new node, after which time, we will re-assess the situation, having eliminated our primary suspect.
ah it looks like the bad-cny alert is correlated with a deployment. Could the problem described in https://gitlab.com/gitlab-com/gl-infra/infrastructure/issues/8507 affect deployments too? That would be troubling, as I thought app tier nodes were removed from LB consideration and drained before the server process is bounced / reloaded
All times UTC.
- 15:56 - Multiple apdex SLO violation alerts incoming
- 15:58 - All apdex SLO violation alerts resolved