2020-11-06: merge requests dashboard timing out for some users

added IncidentActive Source::IMAIncidentDeclare incident severity3 labels

assigned to @bjk-gitlab and @brentnewton

changed the description

Possible corrective action: it would also be very helpful if statement timeouts from load balancing hosts included the host somewhere in the exception. Right now we can't tell from Sentry or our logs if this is one host, all of them, or something in between.

We also don't have the actual query in our logs or Sentry. This meant I made a fatal mistake, that thankfully @grzesiek fixed: this was a different query on canary and main. The query I posted was much slower on one replica, but didn't time out, and that was due to not having data in cache. But that was a red herring as far as this issue was concerned.

We actually have it now because of gitlab-org/gitlab#235161 (closed). It didn't show up for this particular exception though 🤔

Thanks, I thought I'd seen something related. I'll look into it more on Monday.

Hi @ops-gitlab-net,

This incident issue does not have any service attribution. Please add one or more of the appropriate service label that are prefixed with Service:.

Please also add a group:: scoped label to help trace to a correct engineering group.

Thanks for your help! 🖤

You are welcome to help improve this comment.

added auto updated label

added ServicePostgres label

It seems we had 2 events today correlated with ~~slow queries~~ time spend in queries (mainly for raw controller and merge_requests::content_controller) and 503s:

Grafana

Correction - above is not "slow queries" but "time spend in queries".

@grzesiek pointed out gitlab-org/gitlab!45738 (merged).

Here's the timeouts for this on the main stage: https://log.gprd.gitlab.net/goto/1214fb28c57714a41f107c3225b8845e

The production deploy corresponding to this MR started at 09:46 UTC, so that's almost certainly it: https://gitlab.slack.com/archives/C8PKBH3M5/p1604655960184900

@bjk-gitlab ran the query in https://gitlab.com/gitlab-com/gl-infra/infrastructure/-/issues/11840 on all Patroni nodes and it didn't time out anywhere, so the problem is probably the change from gitlab-org/gitlab!45738 (merged).

I'll create a revert MR.

mentioned in merge request gitlab-org/gitlab!47074 (merged)

added ServiceWeb label and removed ServicePostgres label

changed the description

mentioned in issue on-call-handovers#1111 (closed)

mentioned in issue gitlab-org/release/tasks#1806 (closed)

mentioned in issue on-call-handovers#1112 (closed)

gitlab-org/release/tasks#1806 (comment 443445293) should allow us to mark this as IncidentMitigated, but I wasn't involved in the triage so hopefully someone here can confirm first.

@rspeicher we can not deploy the merge request with the fix because there are active incidents 🤔

mentioned in issue on-call-handovers#1113 (closed)

mentioned in issue on-call-handovers#1114 (closed)

mentioned in issue on-call-handovers#1115 (closed)

mentioned in issue on-call-handovers#1116 (closed)

mentioned in issue on-call-handovers#1117 (closed)

mentioned in issue on-call-handovers#1119 (closed)

Status update: This seems to work now, but we are not sure why.

Leaving the incident active for now till the Revert has been fully deployed (in the process of that, still blocked by some other Q/A related incident: #2997 (closed))

mentioned in issue on-call-handovers#1120 (closed)

mentioned in issue on-call-handovers#1121 (closed)

One of our Merge Requests couldn't be accessed yesterday. This morning, we could access it but the MR shows 0 commit (there should be about 30 commits). Pushing another commit to the branch seemed to have fixed it.

In case anyone have similar problem, just create another commit and push it to the remote branch.

mentioned in issue on-call-handovers#1122 (closed)

I think that gitlab-org/release/tasks#1806 (comment 444389353) explains why it works, but the MR gitlab-org/gitlab!47074 (merged) seems to be stuck on staging. Applying the Pick into auto-deploy label resulted in us cherry-picking this to a deploy branch, thus the merge commit from the merge request might not be deployed, however the revert has been immediately deployed and fixed the problem described in this issue.

I wonder if we should fix out tooling tracking deployments of MRs when Pick into auto-deploy label is being used 🤔 /cc @smcgivern @albertoramos

Yeah, thanks for tracking that down @grzesiek! @albertoramos I think we can close this incident.

mentioned in issue gitlab-org/gitlab#277438 (closed)

added IncidentResolved label and removed IncidentActive label

closed

mentioned in issue reliability-reports#117 (closed)

2020-11-06: merge requests dashboard timing out for some users

Summary

Timeline

Incident Review

Summary

Metrics

Customer Impact

Incident Response Analysis

Post Incident Analysis

5 Whys

Lessons Learned

Corrective Actions

Guidelines

Designs

Child items 0

Activity

2020-11-06: merge requests dashboard timing out for some users

Summary

Timeline

Incident Review

Summary

Metrics

Customer Impact

Incident Response Analysis

Post Incident Analysis

5 Whys

Lessons Learned

Corrective Actions

Guidelines

Relates to

Activity