Continuous Delivery for Machine Learning (#4517) · Epics · GitLab.org

Continuous Delivery for Machine Learning

Continuous Delivery for Machine Learning is a software engineering approach in which a cross-functional team produces machine learning applications based on code, data, and models in small and safe increments that can be reproduced and reliably released at any time, in short adaptation cycles. ![image](/uploads/6b7db7090f219dd896b477b64aa706fb/image.png) Dev steps: train a prediction model using the labeled input data, integrate that model into a simple web application, which are then deployed to a production environment Once deployed, our web application ( Figure 3) allows users to select a product and a date in the future, and the model will output its prediction of how many units of that product will be sold on that day. Different teams (personas) collaborate on this model ![image](/uploads/a410ae93d9cb3f3351f941f8e1bde693/image.png) ### New terminology **Data Pipelines** Pipeline is an overloaded term, especially in ML applications. We want to define a "data pipeline" as the process that takes input data through a series of transformation stages, producing data as output. Both the input and output data can be fetched and stored in different locations, such as a database, a stream, a file, etc. The transformation stages are usually defined in code, although some ETL tools allow you to represent them in a graphical form. They can be executed either as a batch job, or as a long-running streaming application. For the purposes of CD4ML, we treat a data pipeline as an artifact, which can be version controlled, tested, and deployed to a target execution environment. **Machine Learning Pipelines** The "machine learning pipeline", also called "model training pipeline", is the process that takes data and code as input, and produces a trained ML model as the output. This process usually involves data cleaning and pre-processing, feature engineering, model and algorithm selection, model optimization and evaluation. While developing this process encompasses a major part of a Data Scientist's workflow , for the purposes of CD4ML, we treat the ML pipeline as the final automated implementation of the chosen model training process. **Bringing this into GitLab's pipeline** In CD4ML, we can model automated and manual ML governance stages into our deployment pipeline, to help detect model bias, fairness, or to introduce explainability for humans to decide if the model should further progress towards production or not. ![image](/uploads/246dec056ad33daeedd5fab7c709a3e8/image.png) ![image](/uploads/633075e0a603606bbe55dc2b099b2b52/image.png) * https://martinfowler.com/articles/cd4ml.html

epic