Alert when KAS crashes or is killed
I've just found a couple of stack trace logs in Elastic for KAS. Those went kinda unnoticed which should NOT happen!
- Why didn't we noticed them?
- What's currently in place? Metrics?
- Look at https://docs.sentry.io/platforms/go/usage/panics/ and consider using it (probably not, it's goroutine local and wound't go all the way.)
- ...
Current Situation
We've noticed that KAS is crashing by analysing logs (see #586 (comment 1932686126)). Those crashes were not reported in any of the metrics / alerts we have in place.
We do have the Containers Termination graph in Kube Container Details (see https://dashboards.gitlab.net/d/kas-kube-containers/kas3a-kube-containers-detail?orgId=1&viewPanel=784136764&from=1717712520000&to=1717734179999). However, the aforementioned crashes have not been recorded there.
Expectations
We wish to have metrics, graphs and alerts for fundamental things like application crashes for KAS.
Edited by Timo Furrer