Epic: New architecture for pipeline creation
Problem to solve
Due to some complications with CUDA versions and compatibility with DinD to support GPU and TensorFlow, we were forced to keep a synchronized version of the base docker image (publishing) and 'experiment' image, this is going to introduce a huge amount of complications and it is not sustainable.
The DinD approach was used to stack multiple data operations in one job, however the same version of library CUDA must be present in epf and also in the publishing base image of each data operation, which will make impossible to run any type of published image (any CUDA version or CPU) because we need to ensure compatibility between all the data operations images and the experiment image.
User experience goal
Therefore if we keep using DinD we cannot ensure that a "stacked data operation job" can run unless they use the same base image for publishing.
Also this is causing extra steps to run the models, which can be directly executed in the published image, creating a more clean implementation.
According to what we discussed, there are enough reasons to redefine how we create the jobs, specifically removing DinD. The new implementation is going back to what we have designed before, here is a list of the main changes we want to implement :
- Remove DinD, this means run the job directly in the published image, this means also remove the 'experiment' image.
- Offer a couple of base images for publishing to start as first iteration. We will start with the one that we have now :
epf
that uses tensorflow2.1.0-gpu-py3. - Create .yml templates for each type job (DataOps, models, visualizations), that will support different data sharing options (branch, artifacts, external Storage).
- MLReef will create a whole pipeline orchestration (.yml ), where we are able to order each job as a stage for example: first dataOps, next a Model, finally a visualization or deployment. we will call it Pipeline Builder.
- We planned to support that users use their own docker images to publish their models, so the responsability for package version compatibility will be entirely in user's hands.
Proposal for Technical Solution
Pipeline builder
Build the stages of the ML cycle as jobs, where each of them has their own .yml template with their own options.
First iteration
- Remove DinD from "mlreef-template.yml". Execute the job directly in the container with the published image.
- Adapt the dataOps "Stacking" to make it work as long as they have the same base image when published.