Skip to content

Draft: Data validation with Pydantic

Edgar R. Mondragón requested to merge (removed):feature/pydantic into main

No code approach

SDK users can already install Pydantic for their taps and swap PropertiesList(...).to_dict() for Model.dict(), where Model is a Pydantic model. However, validation and environment parsing would still be handled by PluginBase._valid_config(), so this wouldn't leverage the full potential of Pydantic.

Note also that this approach doesn't work for streams if the schema has nested objects. That is because Pydantic forces those fields to be "$ref" entries and the actual schemas be part of the definitions key, so they would be missing from the Singer schema. This could be fixed by the SDK if the full nested schema was resolved at runtime and the complete schema message is output. We might have to look at https://python-jsonschema.readthedocs.io/en/stable/references/#jsonschema.RefResolver.resolve_fragment.

Pydantic Settings and plugin config

A better approach for Plugin config with Pydantic is to extend BaseSettings and take advantage of Pydantic's env and secret parsing capabilities. This can be done while continuing to support config_jsonschema by adding the optional attribute config_model. This is approach followed in this MR.

With that, the sample GitLab tap can be run like this:

TAP_GITLAB_AUTH_TOKEN=... \
TAP_GITLAB_PROJECT_IDS=[22672923] \
poetry run gitlab --config ENV --config minConfig.json

where minConfig.json is

{
    "group_ids": [
        "2524164"
    ],
    "start_date": "2015-01-01T00:00:00"
}

TODO

  • By default, environment parsing is eager, so we'd have to override the logic to make it depend on the runtime plugin option parse_env_config.
  • Derive the environment prefix from the plugin name instead of expecting the user to specify it in the settings' Config class (see singer_sdk/samples/sample_tap_gitlab/gitlab_tap.py). I think this is best if we want to keep consistency between all the SDK projects, wdyt @aaronsteers?

Out of scope?

  • As mentioned in the "no code" section, currently the SDK does not resolve "$ref"s in the stream schema (which are totally valid in the JSON schema spec). This is currently handled by some targets (see datamill's target-postgres) but it would be great if the SDK managed this for everyone 😄. Fixing this would allow SDK users to reuse definitions in their schema files if they don't use Pydantic or the SDK typing helpers.

    I think this has to be done before _pop_deselected_schema is called as it expects a fully resolved schema.

Edited by Edgar R. Mondragón

Merge request reports