2020-08-30: Error rate increase for web and api fleet due to CPU and memory pressure on Gitaly node file-49

Summary

2020-08-30: Error rate increase for web and api fleet

One of the Gitaly nodes was targeted by a workload that puts excessive CPU and memory pressure on the node. This caused slowness and some intermittent errors for other users whose repos are colocated on the same storage node.

The abusive workload has been mitigated, and normal performance has resumed.

Timeline

All times UTC.

2020-08-30

At least one gitaly node also had a jump in error rate, so this might be a single specific repo.

15:42 - The rate of HTTP 5xx errors jumped abruptly to 1/second for web and 3/second for api.
15:51 - PagerDuty alert that Gitaly node file-49 has an elevated error rate: "Firing 1 - Gitaly error rate is too high: 11.21"
15:57 - msmiley declares incident in Slack using /incident declare command.
16:10 - Determined that severe CPU and memory pressure on file-49 are leading to slowness and intermittent errors for gRPC calls to that Gitaly node. Identified this as another occurrence of a known pathology. Existing planned corrective actions should prevent this in the near future.
16:23 - Identified the specific project involved in the pathological workload. Collect diagnostic data and confirmed this matches the known pattern that will be mitigated by the proposed fix.
16:44 - PagerDuty alert from the multi-burn-rate heuristic. This alert regards the overall error rate for the Workhorse service tier. It is another side-effect of the same cause, which we discovered earlier, immediately after the first alert about the Gitaly node's error rate increase.
17:15 - Recovery. Error rate returns to normal after mitigation is applied.
17:25 - CPU and memory resource usage returns to normal after finishing the clean-up steps.

Edited Sep 24, 2020 by AnthonySandoval