Add support for Nvidia docker runtime to the Kubernetes executor
- Add support for Nvidia docker runtime to the Kubernetes executor.
- Some customers need to use the GitLab Runner Kubernetes executor on GPU enabled nodes using the Nvidia docker runtime. However, this is not working and customers are reporting that the GPU jobs fails.
Notes from Jean-Pierre Huynh's initial issue description on running with an AWS EKS GPU Optimized AMI:
I am trying to migrate our GPU workload currently running on docker+machine executors to Kubernetes to unify our Gitlab Runner stack.
The non GPU workload works fine on our Kubernetes (1.12) cluster running on EKS but trying to run the GPU jobs, it fails with a generic error
ERROR: Job failed (system failure): pod already succeeded before it begins running that doesn't really help us to debug.
Note that we tried a lot of different scenarios and narrowed it down to something going wrong between Gitlab Runner (Potentially located to the helper container of the pod) and the AMI we are using
ami-0fbc930681258db86 that is configuring Docker to use the nvidia runtime.
The underlying infrastructure doesn't seem to be a problem here because deploying pods manually requesting or not the GPU on those nodes work as expected.
It's only through gitlab-runner that we see the problem.
Steps to reproduce
Run a gitlab-ci job on a Kubernetes node using that is using the following AMI from Amazon:
The job is scheduled, the pod is created but completes within a second with the following error (Gitlab Runner running with DEBUG log level):
Container "helper" exited with error: pod already succeeded before it begins running job=334848 project=241 runner=Kz9oKHks ERROR: Job failed (system failure): pod already succeeded before it begins running duration=5.056888263s job=334848 project=241 runner=Kz9oKHks Appending trace to coordinator... ok code=202 job=334848 job-log=0-771 job-status=running runner=Kz9oKHks sent-log=522-770 status=202 Accepted Submitting job to coordinator... ok code=200 job=334848 job-status= runner=Kz9oKHks ERROR: Failed to process runner builds=0 error=pod already succeeded before it begins running executor=kubernetes runner=Kz9oKHks
We are using Amazon EKS
Used GitLab Runner version
This has been discussed with @aciciu through the support ticket number 121165 but raising it here for visibility.