Add option to disable K8s warning logs
MR 4211 in 16.2 added a new feature to surface the latest event with the highest severity related to a failed pod. The theory is that if there's an image pull failure or similar that's caused the error, now the user can actually see it.
However, this is wasting our devs time. We have a persistent K8s warning for our pods that basically says "it took slightly longer to bring up a pod because we waited for the volume to be created". The specific error doesn't matter here, only that there is one.
Now, whenever a job fails, the dev sees something like this at the end of the log:
---job logs---
<ACTUAL JOB FAILURE REASON>
WARNING: Event retrieved from the cluster: 0/36 nodes are available: 36 waiting for ephemeral volume controller to create the persistentvolumeclaim "runner-x-project-1-concurrent-0-x-builds". preemption: 0/36 nodes are available: 36 Preemption is not helpful for scheduling.
Cleaning up project directory and file based variables
ERROR: Job failed: command terminated with exit code 1
Looking at this, it's super easy to miss the <ACTUAL JOB FAILURE REASON>
at the top, because of the big K8s warning. So we have many devs asking if there's an infrastructure problem when nothing is wrong at all.
If it's not the Preemption warning, there's sometimes a different log that we get simply because the pod doesn't come up super fast. There's always a warning for the system to print and it's confusing everyone. I've had to explain to dozens of devs that their real error is slightly higher so they can actually troubleshoot.
TL;DR: We'd like a configuration option to disable the additional K8s logs on job failure.