docs: how to monitor pods killed by kubernetes because of pod limits
From the customer inquiry:
we are using GitLab runners configured with the Kubernetes executor. We have default CPU/RAM/Ephemeral storage requests/limits configured. If a job ends up exceeding one of those, the Pod will get reaped by Kubernetes. Unfortunately, the job logs give no indication to the developer that there job died due to exceeding these limits. Is there any way to configure our runners to be aware of this and to report it in the job logs?
and from @ggeorgiev_gitlab:
We don't. I think this falls under the space of "This should be done in Kubernetes". When a Pod is killed it's killed with a specific code so I imagine a Kubernetes monitoring can monitor that but I have no practical example, but as a whole this a Kubernetes problem and not a Runner problem.
I don't think we would particularly want to add it simply because it won't work reliably. To know the exit code of a Pod we need to query the API and ask for that specific Pod, but if the Pod's already gone we will have no info. You basically need permissions to plug into the Kubernetes master event stream to reliably know when and why an event happened
a docs entry on how to monitor the pods externally might help