Support for data validation with Great Expectations
Problem to solve
This started with a conversation with @DouweM: We were wondering if it was possible to integrate data validation with Great Expectations more tightly into a Meltano workflow. This could be as simple as a "meltano add validator" type wrapper and integration into the elt runs, or some other form of integration
It would also be neat for Great Expectations to show up in the Meltano UI, maybe with Data Docs/validation results integrated into the UI, potentially even with a "configure" interface to the great_expectations.yml config file (although that is definitely a stretch).
One example for a UI integration of GE validation results is Dagster: https://dagster.io/blog/great-expectations-for-dagster
Target audience
The target audience would be two-fold:
- Data engineers/owners of the Meltano workflow who run the pipelines and want to see whether all source data was correctly extracted and whether data was correctly transformed, and
- data consumers (stakeholders) that want to use Data Docs for data documentation and to see the most recent validation status.
Further details
Not much to add - data validation is crucial to pipelines, and I think making it more accessible/part of Meltano would be really beneficial for users!
Proposal (older)
Click to expand
I would love to hear from existing GE users how they envision this could work - here are some of my thoughts:
- At a very high level, maybe we could add another key concept like a "validator" that can be added to a Meltano project. This could potentially even be a wrapper for "great_expectations init" to initialize a GE data context in the Meltano project.
- A user would then either configure a GE datasource manually, or all data connections could be inherited from Meltano.
- The user would then create Expectation Suites and Checkpoints. Running "meltano elt" (possibly with a --run-validation flag?) could then trigger running all configured Checkpoints as part of a pipeline. I'm not quite clear yet how or whether a user would specify when exactly to run validation.
What does success look like, and how can we measure that?
Acceptance criteria: a user can run validation with GE as part of a Meltano pipeline run without having to invoke GE separately, and can see the results either in Data Docs or directly in the Meltano UI.
Success: people actually run validation with GE in Meltano!
Links / references
- GE homepage: greatexpectations.io
- Dagster integration: https://dagster.io/blog/great-expectations-for-dagster
- Airflow operator (as an example of how to invoke validation): https://github.com/great-expectations/airflow-provider-great-expectations
Updated Proposal (2022-01-05)
- Using the recently released Meltano features
dbt test
andtest commands
, Meltano users should be able to add Great Expectations as a utility plugin.- First phase: Great Expectations can be added manually to
meltano.yml
by users familiar with Great Expectations (or in Meltano-owned projects like the Hub). - Second phase: Assuming positive results from "first phase" above, Great Expectations will be added to
discovery.yml
so that users can add it using the commandmeltano add utility great-expectations
(with no--custom
flag).
- First phase: Great Expectations can be added manually to
- Within
meltano.yml
, users will addcommands
with names starting with atest*
prefix. - Tests will be runnable using any of these:
meltano test --all
meltano test great-expectations
meltano test great-expectations:test-foo great-expectations:test-bar
-
meltano run great-expectations:test-foo great-expectations:test-bar
(because tests are also commands)
Special commands and functions
Docs/UI:
We may also need to come up with a great-expectations:ui
command which would build and launch
Init:
We may also want a predefined init
command and perhaps commands for other administrative operations.