Epic: New architecture for pipeline creation

Problem to solve

Due to some complications with CUDA versions and compatibility with DinD to support GPU and TensorFlow, we were forced to keep a synchronized version of the base docker image (publishing) and 'experiment' image, this is going to introduce a huge amount of complications and it is not sustainable.

The DinD approach was used to stack multiple data operations in one job, however the same version of library CUDA must be present in epf and also in the publishing base image of each data operation, which will make impossible to run any type of published image (any CUDA version or CPU) because we need to ensure compatibility between all the data operations images and the experiment image.

User experience goal

Therefore if we keep using DinD we cannot ensure that a "stacked data operation job" can run unless they use the same base image for publishing.

Also this is causing extra steps to run the models, which can be directly executed in the published image, creating a more clean implementation.

According to what we discussed, there are enough reasons to redefine how we create the jobs, specifically removing DinD. The new implementation is going back to what we have designed before, here is a list of the main changes we want to implement :

Remove DinD, this means run the job directly in the published image, this means also remove the 'experiment' image.
Offer a couple of base images for publishing to start as first iteration. We will start with the one that we have now : epf that uses tensorflow2.1.0-gpu-py3.
Create .yml templates for each type job (DataOps, models, visualizations), that will support different data sharing options (branch, artifacts, external Storage).
MLReef will create a whole pipeline orchestration (.yml ), where we are able to order each job as a stage for example: first dataOps, next a Model, finally a visualization or deployment. we will call it Pipeline Builder.
We planned to support that users use their own docker images to publish their models, so the responsability for package version compatibility will be entirely in user's hands.

Proposal for Technical Solution

Pipeline builder

Build the stages of the ML cycle as jobs, where each of them has their own .yml template with their own options.

First iteration

Remove DinD from "mlreef-template.yml". Execute the job directly in the container with the published image.
Adapt the dataOps "Stacking" to make it work as long as they have the same base image when published.

Epic: New architecture for pipeline creation

Problem to solve

User experience goal

Proposal for Technical Solution

Pipeline builder

First iteration

Permissions and Security

Additional Notes

What is the type of buyer?

Is this a cross-stage feature?

Links / references

Documentation

Availability, Testing & Test Cases

What does success look like, and how can we measure that?