2020-08-27: High memory usage by workhorse on API nodes
Summary
2020-08-27: High memory usage by workhorse on API nodes
We discovered an issue with large LSIF uploads from CI jobs. Workhorse processes these before handing them off to rails, and buffers the whole request body in memory.
Because we don't yet know how large it will be, this can result in huge allocations for large file uploads. In this case we were getting LSIF artifacts of 12 GB and larger.
This allocated memory cannot be easily returned to the OS. Thus, the workhorse process remains bloated, and also steals page cache from rails, running on the same box.
This was exacerbated by retry logic in the client re-attempting the upload 3 times, leading up to 4 workhorses to get "infected" with memory bloat every time this happens.
Timeline
All times UTC.
2020-08-27
- 09:13 - iwiedler declares incident in Slack using
/incident declare
command. - 09:27 - workhorse is identified as the memory hog.
- 09:37 - Continuous profiling informs hypothesis of LSIF code allocating too much.
- 10:20 - We have a smoking gun, large LSIF requests in the workhorse logs.
- 10:33 - Source of allocation is found, remediation issue filed.
Click to expand or collapse the Incident Review section.
Incident Review
Summary
- Service(s) affected: workhorse, ServiceAPI
- Team attribution: n/a
- Minutes downtime or degradation: A few bursts of 500s, but we mostly stayed within SLO.
Metrics
https://dashboards.gitlab.net/d/api-main/api-overview?orgId=1&from=1598512885848&to=1598572701233
Customer Impact
Minimal.
Incident Response Analysis
- How was the event detected? PagerDuty alert: https://gitlab.pagerduty.com/incidents/PNX6KBT
- How could detection time be improved? It was solid
- How did we reach the point where we knew how to mitigate the impact? an hour or so
- How could time to mitigation be improved? Pretty solid IMHO.
Post Incident Analysis
- How was the root cause diagnosed? Zoom call between @jacobvosmaer-gitlab, @nolith, @igor.drozdov, and myself. We used continuous profiling and kibana.
- How could time to diagnosis be improved? We tracked it down pretty quickly.
- Do we have an existing backlog item that would've prevented or greatly reduced the impact of this incident? Yes, there was https://gitlab.com/gitlab-org/gitlab/-/issues/227862.
- Was this incident triggered by a change (deployment of code or change to infrastructure. If yes, have you linked the issue which represents the change?)? No.
5 Whys
I'll try and cover the intent of this below in lessons learned instead.
Lessons Learned
- Ultimately, this is a Application Limits problem. We had some logic for limiting upload size, but it wasn't working as intended. We need to be especially careful with in-memory buffering on unbounded inputs.
- Continuous profiling saved the day! The fact that we have continuous profiling for go services, and go has a decent memory profiler, enabled us to very quickly diagnose this issue. The ability to select "worst" buckets in cloud profiler helps narrow down the search.
Corrective Actions
As per @nolith:
- gitlab-org/gitlab-workhorse#284 (closed) - lsif parser should not buffer zip files in memory
- gitlab-org/gitlab#241720 (closed) - When a lsif artifact is too large, we should return 413 and not 500