Skip to content

WIP: use k8s jobs instead of pod+exec

Chet Lemon requested to merge (removed):shield-ai-11-10-1 into 11-10-stable

What does this MR do?

WIP MR to show what we did to solve the problem mentioned in issue: #3814 (closed)

I have set the target branch to the point where we originally branched off.

I have altered the CI yaml to work with our infrastructure, and have broken other functionality (non kubernetes executors) to get this working for our use case. Unfortunately, I do not have the bandwidth to spend much time on this, so I'm hoping this can be a useful platform for discussion.

High-level changes:

  • use of K8s jobs instead of direct pod creation
  • helper and build commands are placed into a configmap and deployed/cleaned up along with the jobs
    • job mounts the configmap as a volume and runs the scripts
  • added GPU support

Benefits:

  • Huge savings on cloud costs (we use GKE)
    • we can enable autoscaling and preemtible node types
    • builds are tolerant to the node being shut down at any time. They will simply restart. (thanks to k8s jobs!)
  • Higher pipeline reliability
    • we do not see broken exec pipes on our long running (>60min) builds! issue: #3814 (closed)

Why was this MR needed?

Refactoring to use k8s Job objects instead of the existing pod + exec strategy made our pipelines much more stable overall.

What are the relevant issue numbers?

#3814 (closed)

Edited by Chet Lemon

Merge request reports