2020-08-27: High memory usage by workhorse on API nodes

Summary

2020-08-27: High memory usage by workhorse on API nodes

We discovered an issue with large LSIF uploads from CI jobs. Workhorse processes these before handing them off to rails, and buffers the whole request body in memory.

Because we don't yet know how large it will be, this can result in huge allocations for large file uploads. In this case we were getting LSIF artifacts of 12 GB and larger.

This allocated memory cannot be easily returned to the OS. Thus, the workhorse process remains bloated, and also steals page cache from rails, running on the same box.

This was exacerbated by retry logic in the client re-attempting the upload 3 times, leading up to 4 workhorses to get "infected" with memory bloat every time this happens.

Timeline

All times UTC.

2020-08-27

09:13 - iwiedler declares incident in Slack using /incident declare command.
09:27 - workhorse is identified as the memory hog.
09:37 - Continuous profiling informs hypothesis of LSIF code allocating too much.
10:20 - We have a smoking gun, large LSIF requests in the workhorse logs.
10:33 - Source of allocation is found, remediation issue filed.

Click to expand or collapse the Incident Review section.

Incident Review

Summary

Service(s) affected: workhorse, ServiceAPI
Team attribution: n/a
Minutes downtime or degradation: A few bursts of 500s, but we mostly stayed within SLO.

Metrics

https://dashboards.gitlab.net/d/api-main/api-overview?orgId=1&from=1598512885848&to=1598572701233

Customer Impact

Minimal.

Incident Response Analysis

How was the event detected? PagerDuty alert: https://gitlab.pagerduty.com/incidents/PNX6KBT
How could detection time be improved? It was solid
How did we reach the point where we knew how to mitigate the impact? an hour or so
How could time to mitigation be improved? Pretty solid IMHO.

Post Incident Analysis

How was the root cause diagnosed? Zoom call between @jacobvosmaer-gitlab, @nolith, @igor.drozdov, and myself. We used continuous profiling and kibana.
How could time to diagnosis be improved? We tracked it down pretty quickly.
Do we have an existing backlog item that would've prevented or greatly reduced the impact of this incident? Yes, there was https://gitlab.com/gitlab-org/gitlab/-/issues/227862.
Was this incident triggered by a change (deployment of code or change to infrastructure. If yes, have you linked the issue which represents the change?)? No.

5 Whys

I'll try and cover the intent of this below in lessons learned instead.

Lessons Learned

Ultimately, this is a Application Limits problem. We had some logic for limiting upload size, but it wasn't working as intended. We need to be especially careful with in-memory buffering on unbounded inputs.
Continuous profiling saved the day! The fact that we have continuous profiling for go services, and go has a decent memory profiler, enabled us to very quickly diagnose this issue. The ability to select "worst" buckets in cloud profiler helps narrow down the search.

Corrective Actions

As per @nolith:

gitlab-org/gitlab-workhorse#284 (closed) - lsif parser should not buffer zip files in memory
gitlab-org/gitlab#241720 (closed) - When a lsif artifact is too large, we should return 413 and not 500

Guidelines

Blameless RCA Guideline

Edited Aug 28, 2020 by Igor