An error occurred while fetching the assigned iteration of the selected issue.
2020-11-06: merge requests dashboard timing out for some users
Summary
We're seeing more query timeouts for the merge requests dashboard (for instance, https://gitlab.com/dashboard/merge_requests?assignee_username=smcgivern for me) for some users: https://log.gprd.gitlab.net/goto/828433d40a4fc1cd0057724cd2d4694f
The most affected users are at https://log.gprd.gitlab.net/goto/f10f31cfc140642746e8e1d4be29c7aa, but some people can load their dashboard fine.
Context will be added here as we investigate.
Timeline
All times UTC.
2020-11-06
- 10:10 - Sean asks in #is-this-known and Nick creates an issue about handling the timeouts here: gitlab-org/gitlab#277355
- 11:35 - Sean asks in #database due to some odd query plans: https://gitlab.slack.com/archives/C3NBYFJ6N/p1604662538407500
- 13:16 - Sean creates an issue based on that thread: https://gitlab.com/gitlab-com/gl-infra/infrastructure/-/issues/11840
- 13:39 - smcgivern declares incident in Slack.
- 14:00 - In a call, @bjk-gitlab demonstrates that the query from https://gitlab.com/gitlab-com/gl-infra/infrastructure/-/issues/11840 does not time out on any replica.
- 14:07 - @grzesiek creates gitlab-org/gitlab#277386 (closed) linking to the code change that caused this.
- 14:18 - Sean creates a revert MR: gitlab-org/gitlab!47074 (merged)
Click to expand or collapse the Incident Review section.
Incident Review
Summary
- Service(s) affected:
- Team attribution:
- Minutes downtime or degradation:
Metrics
Customer Impact
- Who was impacted by this incident? (i.e. external customers, internal customers)
- What was the customer experience during the incident? (i.e. preventing them from doing X, incorrect display of Y, ...)
- How many customers were affected?
- If a precise customer impact number is unknown, what is the estimated potential impact?
Incident Response Analysis
- How was the event detected?
- How could detection time be improved?
- How did we reach the point where we knew how to mitigate the impact?
- How could time to mitigation be improved?
Post Incident Analysis
- How was the root cause diagnosed?
- How could time to diagnosis be improved?
- Do we have an existing backlog item that would've prevented or greatly reduced the impact of this incident?
- Was this incident triggered by a change (deployment of code or change to infrastructure. If yes, have you linked the issue which represents the change?)?