Connection-free testing capability: `--replay` and `--demo`

Background

In my effort to support development on tap-powerbi-metadata, I'm realizing that I am overly reliant on the tester and co-developer because I don't have access to valid credentials for the source. It is not trivial to create test creds either, since this requires at minimum: creating a new Azure account with a valid credit card, creating an Active Directory Domain, creating a Service Principal, creating a new Power BI tenant, and finally granting my Service Principal access to the Power BI tenant. In fact, there are still more steps on top of the above, since my new Power BI environment probably doesn't have any reports, workspaces, log history, etc. In other words, even after gaining valid creds, I would not necessarily have valid test data to validate the data sync methods are working properly.

Proposal:

As a sibling and complement of planned connection tests in #14 (closed), we would like to be able to run "as close as possible" to a full sync test without having any access to the upstream connection. This is difficult by nature to make generic, since there isn't a "dummy data set" which would make sense for every stream. We'd also like to create portable and replayable versions of synced data, so that we get better reproducibility - and to do so in a way that "just works", regardless of which tap or source system we are dealing with.

Use cases

Find bugs, repro them, and apply fixes locally - without requiring source credentials or network connectivity.
Provide a sample data output option for tap users.
- By specifying the path to a "golden set" of jsonl output in the repo, developers can optionally "opt-in" to enabling sample data output for end users. This "demo mode" would allows users (and orchestrators like meltano!) to preview the data and understand the emitted data shape even before specifying a connection. To be widely adopted, the data format for the sample set needs to be generated automatically from the SDK (of course, likely after some amount of anonymizing by the developer).
Adding replay-capability would vastly expand the number and breadth of tests which could be run - with or without connectivity. Importantly:
1. We could test that SCHEMA message generation is correct, based on catalog input and/or predefined tap logic.
2. We could verify STATE message emitting behaviors are tested against config inputs.
3. We could ensure stream selection rules are tested, as specified by the optional catalog input, per usual.
4. We can test that streams' jsonschema config compliant to the JSON Schema spec and that data within the sample stream complies with the specified jsonschema definition on the stream.
Bug reproducibility across teams.
- We would have a better way to repro errors across teams: as a user experiences an issue specific to their environment, they can save the output and transmit it (securely and/or anonymized, of course) to a tap developer who can then replay the offending stream, repro the issue, debug and confirm the fix, add unit tests, etc. - all without ongoing input from the user.
Perform quality control tests en masse.
- For indexes like the planned singer-db, we can more realistically scale to hundreds or thousands of taps, performing automated generic CI/CD testing without having to manage hundreds or thousands of corresponding source credentials. (Having sample data capabilities would likely result in a "badge" of some sort in the index, along with the latest CI/CD test results based on that sample data.)

Proposed implementation

Internal changes proposed:

Write a new Stream._replay_records() into the SDK base classes as an alternative path to get_records(). This function would never need to be overridden by developers since it would be implemented generically. In order to meet that design goal (i.e. not requiring dev effort), we would require a generic, predefined text file format. The easiest and most generalizable file format is our already-defined jsonl output from the tap itself.

Proposed CLI updates:

Add a new --replay=path/to/output.jsonl capability which would then run in dry-run mode using the sample data. The process of creating a source connection would then be skipped.
- At least initially, --catalog would be required whenever --replay is set.
- The --replay option should be
Add a new optional --demo capability which is automatically enabled if the tap developer specifies a path to a valid and replayable demo data set, including a catalog file and at least one jsonl file. When the capability is supported, tap-mysource --demo is equivalent to tap-mysource --replay=path/to/demo/replay-file.json --catalog=path/to/demo/catalog.json.

Why raw `jsonl` sync output as the standard "data replay" format:

After considering several options, I landed on native jsonl output as the best storage mechanism I could think of for enabling this functionality across the wide ecosystem of existing taps.

By definition, this output already describes all the nuances of each diverse data set, which is hard to say for any other data serialization method. I first considered using target-csv generically but experience has shown that CSV doesn't work well for complex and nested data sets. We could consider target-jsonl or target-parquet but neither is simpler or offers any significant benefit over simply replaying the raw output data. (See "out of scope" section below for possible future extensibility options.)
As a native text file format, jsonl is very easy to review for PII and other confidential information which could then be relatively easily be replaced with obfuscated/generic data. (See "out of scope" for thoughts around auto-obfuscation.)
It's very easy to truncate all-but the first 100 or 1000 rows in order to get a smaller data file.
At least in terms of generating the datasets themselves, no new code or training is needed, since this already comes out of box with every Singer tap - even those not built on the SDK. (That means we can replay data generated on a pre-SDK version using the SDK version, and then validate the new output against the original.)

Other Notes:

Caveats:

For this to be valuable and effective for testing purposes, we should run through as much of the "real" data flow as possible:

Since part of what we are wanting to test is that SCHEMA and RECORD message types are properly generated, we would need to treat RECORD messages as "raw data" and not simply echo them.
Similarly for SCHEMA messages, those schema messages stored in the jsonl output should either be ignored completely or used as test assertions. We would not simply echo them out, since one of the objectives of the test are to ensure that they are correctly generated by the developers implementation.
Config values would still need to be parsed or passed as usual, since some of those config values will modify how the output is generated. Credential-based config might still be required (as per usual validation rules) but dummy values could be passed, since those specific setting values would effectively be ignored.

Out-of-scope but worthy of discussion:

Eventually we could add an auto-anonymization option via something like pyanonymizer.
We might eventually allow developers to write alternative dummy-data generation methods in addition to the private _replay_records() method discussed here as generic.
We might eventually create a process or toolset for running diffs against successive outputs. For example, this could be built as a CI/CD test to better ensure properly behaving taps and highlight any changes across releases.

Edited Mar 02, 2021 by AJ Steers