2020-04-30 Thanos OOM (cgroup limit) on production
Summary
The thanos process on gprd prometheus instances (prometheus-01-inf-gprd at least, possibly others) is hitting its configured cgroup memory limit (512MB) and is being killed. This ha sbeen happening for several days, but today happened 3 times in less than an hour resulting in a page.
This is an S4 only; there is no service outage to GitLab.com. It might be affecting observability, but only historical (probably), and the affect is likely to be intermittent/minimal, as thanos is running successfully most of the time.
More information will be added as we investigate the issue.
Timeline
All times UTC.
2020-04-30
- 00:04 - Page received, investigations began