Add support for Nvidia docker runtime to the Kubernetes executor
Proposal
- Add support for Nvidia docker runtime to the Kubernetes executor.
Background
- Some customers need to use the GitLab Runner Kubernetes executor on GPU enabled nodes using the Nvidia docker runtime. However, this is not working and customers are reporting that the GPU jobs fails.
Notes from Jean-Pierre Huynh's initial issue description on running with an AWS EKS GPU Optimized AMI:
I am trying to migrate our GPU workload currently running on docker+machine executors to Kubernetes to unify our Gitlab Runner stack.
The non GPU workload works fine on our Kubernetes (1.12) cluster running on EKS but trying to run the GPU jobs, it fails with a generic error ERROR: Job failed (system failure): pod already succeeded before it begins running
that doesn't really help us to debug.
Note that we tried a lot of different scenarios and narrowed it down to something going wrong between Gitlab Runner (Potentially located to the helper container of the pod) and the AMI we are using ami-0fbc930681258db86
that is configuring Docker to use the nvidia runtime.
The underlying infrastructure doesn't seem to be a problem here because deploying pods manually requesting or not the GPU on those nodes work as expected.
It's only through gitlab-runner that we see the problem.
Steps to reproduce
Run a gitlab-ci job on a Kubernetes node using that is using the following AMI from Amazon: ami-0fbc930681258db86
Actual behavior
The job is scheduled, the pod is created but completes within a second with the following error (Gitlab Runner running with DEBUG log level):
Container "helper" exited with error: pod already succeeded before it begins running job=334848 project=241 runner=Kz9oKHks
ERROR: Job failed (system failure): pod already succeeded before it begins running duration=5.056888263s job=334848 project=241 runner=Kz9oKHks
Appending trace to coordinator... ok code=202 job=334848 job-log=0-771 job-status=running runner=Kz9oKHks sent-log=522-770 status=202 Accepted
Submitting job to coordinator... ok code=200 job=334848 job-status= runner=Kz9oKHks
ERROR: Failed to process runner builds=0 error=pod already succeeded before it begins running executor=kubernetes runner=Kz9oKHks
Environment description
We are using Amazon EKS
Used GitLab Runner version
Version 11.11.2
Additional information
This has been discussed with @aciciu through the support ticket number 121165 but raising it here for visibility.