Connection-free testing capability: `--replay` and `--demo`
Background
In my effort to support development on tap-powerbi-metadata
, I'm realizing that I am overly reliant on the tester and co-developer because I don't have access to valid credentials for the source. It is not trivial to create test creds either, since this requires at minimum: creating a new Azure account with a valid credit card, creating an Active Directory Domain, creating a Service Principal, creating a new Power BI tenant, and finally granting my Service Principal access to the Power BI tenant. In fact, there are still more steps on top of the above, since my new Power BI environment probably doesn't have any reports, workspaces, log history, etc. In other words, even after gaining valid creds, I would not necessarily have valid test data to validate the data sync methods are working properly.
Proposal:
As a sibling and complement of planned connection tests in #14 (closed), we would like to be able to run "as close as possible" to a full sync test without having any access to the upstream connection. This is difficult by nature to make generic, since there isn't a "dummy data set" which would make sense for every stream. We'd also like to create portable and replayable versions of synced data, so that we get better reproducibility - and to do so in a way that "just works", regardless of which tap or source system we are dealing with.
Use cases
- Find bugs, repro them, and apply fixes locally - without requiring source credentials or network connectivity.
- Provide a sample data output option for tap users.
- By specifying the path to a "golden set" of jsonl output in the repo, developers can optionally "opt-in" to enabling sample data output for end users. This "demo mode" would allows users (and orchestrators like meltano!) to preview the data and understand the emitted data shape even before specifying a connection. To be widely adopted, the data format for the sample set needs to be generated automatically from the SDK (of course, likely after some amount of anonymizing by the developer).
- Adding replay-capability would vastly expand the number and breadth of tests which could be run - with or without connectivity. Importantly:
- We could test that
SCHEMA
message generation is correct, based on catalog input and/or predefined tap logic. - We could verify
STATE
message emitting behaviors are tested against config inputs. - We could ensure stream selection rules are tested, as specified by the optional catalog input, per usual.
- We can test that streams'
jsonschema
config compliant to the JSON Schema spec and that data within the sample stream complies with the specifiedjsonschema
definition on the stream.
- We could test that
- Bug reproducibility across teams.
- We would have a better way to repro errors across teams: as a user experiences an issue specific to their environment, they can save the output and transmit it (securely and/or anonymized, of course) to a tap developer who can then replay the offending stream, repro the issue, debug and confirm the fix, add unit tests, etc. - all without ongoing input from the user.
- Perform quality control tests en masse.
- For indexes like the planned
singer-db
, we can more realistically scale to hundreds or thousands of taps, performing automated generic CI/CD testing without having to manage hundreds or thousands of corresponding source credentials. (Having sample data capabilities would likely result in a "badge" of some sort in the index, along with the latest CI/CD test results based on that sample data.)
- For indexes like the planned
Proposed implementation
Internal changes proposed:
Write a new Stream._replay_records()
into the SDK base classes as an alternative path to get_records()
. This function would never need to be overridden by developers since it would be implemented generically. In order to meet that design goal (i.e. not requiring dev effort), we would require a generic, predefined text file format. The easiest and most generalizable file format is our already-defined jsonl
output from the tap itself.
Proposed CLI updates:
- Add a new
--replay=path/to/output.jsonl
capability which would then run in dry-run mode using the sample data. The process of creating a source connection would then be skipped.- At least initially,
--catalog
would be required whenever--replay
is set. - The
--replay
option should be
- At least initially,
- Add a new optional
--demo
capability which is automatically enabled if the tap developer specifies a path to a valid and replayable demo data set, including a catalog file and at least one jsonl file. When the capability is supported,tap-mysource --demo
is equivalent totap-mysource --replay=path/to/demo/replay-file.json --catalog=path/to/demo/catalog.json
.
jsonl
sync output as the standard "data replay" format:
Why raw After considering several options, I landed on native jsonl output as the best storage mechanism I could think of for enabling this functionality across the wide ecosystem of existing taps.
- By definition, this output already describes all the nuances of each diverse data set, which is hard to say for any other data serialization method. I first considered using
target-csv
generically but experience has shown that CSV doesn't work well for complex and nested data sets. We could considertarget-jsonl
ortarget-parquet
but neither is simpler or offers any significant benefit over simply replaying the raw output data. (See "out of scope" section below for possible future extensibility options.) - As a native text file format,
jsonl
is very easy to review for PII and other confidential information which could then be relatively easily be replaced with obfuscated/generic data. (See "out of scope" for thoughts around auto-obfuscation.) - It's very easy to truncate all-but the first 100 or 1000 rows in order to get a smaller data file.
- At least in terms of generating the datasets themselves, no new code or training is needed, since this already comes out of box with every Singer tap - even those not built on the SDK. (That means we can replay data generated on a pre-SDK version using the SDK version, and then validate the new output against the original.)
Other Notes:
Caveats:
For this to be valuable and effective for testing purposes, we should run through as much of the "real" data flow as possible:
- Since part of what we are wanting to test is that
SCHEMA
andRECORD
message types are properly generated, we would need to treat RECORD messages as "raw data" and not simply echo them. - Similarly for SCHEMA messages, those schema messages stored in the
jsonl
output should either be ignored completely or used as test assertions. We would not simply echo them out, since one of the objectives of the test are to ensure that they are correctly generated by the developers implementation. - Config values would still need to be parsed or passed as usual, since some of those config values will modify how the output is generated. Credential-based config might still be required (as per usual validation rules) but dummy values could be passed, since those specific setting values would effectively be ignored.
Out-of-scope but worthy of discussion:
- Eventually we could add an auto-anonymization option via something like pyanonymizer.
- We might eventually allow developers to write alternative dummy-data generation methods in addition to the private
_replay_records()
method discussed here as generic. - We might eventually create a process or toolset for running diffs against successive outputs. For example, this could be built as a CI/CD test to better ensure properly behaving taps and highlight any changes across releases.