Product Vision for Data Engineering/Analytics/Science (DataOps)
I was rewatching the 2018 Product Vision video https://www.youtube.com/watch?v=RmSTLGnEmpQ and some thoughts came to mind around Meltano.
-
What if there was a separate tab within GitLab for Data Operations? Basically, a souped-up, better version of the Airflow UI for managing batch (maybe streaming) jobs, viewing logs, errors, etc.
-
It'd enable you to surface alerts like "hey, there's a new field on this Salesforce object. We've automatically mapped it to xyz, but you can override here"
-
You could have aggregate stats on specific jobs and highlight areas for improvement (Job Y has 4.3% failure rate - click to see logs)
-
Secret var management could be integrated and tied to specific jobs.
-
Schema manifests could be read and interacted with via the UI.
-
API limits could be declared and managed in the UI and the DataOps tab would keep track of calls. You could even declare the API test harness for each source.
-
-
CI isn't the right place for moving data
-
CI/CD is about testing, not actually moving data around. We could have default, recommended tests for each pipeline that's integrated into GitLab. The tests could be minimal around data integrity (like what we're doing with dbt), or it could be large-scale where ~10% of every table is used as the basis for a new data warehouse and the pipelines are run on that.
-
The DataOps tab then becomes the management center of actually moving the data around. Pipelines are continuously (every ~10 min.) kicked off and once tests pass on a new version of the pipeline, the next pipeline run picks up the new version
-
By keeping the focus of CI on actually testing everything about pipelines and data movement, it relieves the pressure of them having to keep running everything all the time.
-
We could have a tight integration with dbt and show the transformation DAG that's generated
-
-
This could then translate into versioning ML models and monitoring their performance in production. So similar to how we can have "gitlab bot" auto deploy and auto revert, we could do the same thing with new versions of ML models if they pass or fail certain thresholds.
- Then you can integrate things like lore so that in addition to a
git clone
of the project, you canmeltano clone
and get the harness required to do Machine Learning and to update any pipelines.
- Then you can integrate things like lore so that in addition to a
I'm a little all over the place with this, but that video got my brain juices flowing. The code of a project declares what the application should be doing, CI does the testing so that changes don't break the application, CD deploys new changes to the application. In this case, the application is moving data around constantly but we could make smart abstractions for that app to make it easier and integrated!
cc @joshlambert @jschatz1 @tlapiana @emilielimaburke @iroussos @mbergeron @zamai