Preview: Target SDK release
Closes #96 (closed) Partially implements #63 (closed) Includes and replaces !92 (closed)
Summary of implementation, from the docs here in this branch:
Create targets with singer-sdk
requires overriding just two classes:
- The
Target
class. This class governs configuration, validation, and stream discovery. - The
Sink
class. This class is responsible for writing records to the target and keeping tally of written records. EachSink
implementation my write records immediately inSink.load_record()
or in batches duringSink.drain()
.
This writeup describes the need to decouple the Sink
and Stream
classes:
While the default implementation will create one Sink
per stream, this is not required. The Stream:Sink
mapping behavior can be overwritten in the following ways:
-
1:1
: This is the default, where only one sink is active for each incoming stream name.- The exception to this rule is when a knew STATE message is received for an already-active stream. In this case, the existing sink will be marked to be drained and a new sink will be initialized to receive the next incoming records.
- In the case that a sink is archived because of a superseding STATE message, all prior version(s) of the stream's sink are guaranteed to be drained in creation order.
- Example: a database-type target where each stream will land in a dedicated table.
-
1:many
: In this scenario, the target intentionally creates multiple sinks per stream. The developer may overrideTarget.get_sink()
and use details within the record (or a randomization algorithm) to send records to multiple sinks all for the same stream.- Example: a data lake target where output files should be pre-partitioned according to one or more attributes within the record. Multiple smaller files, named according to their partition key values are more efficient than fewer larger files.
-
many:1
: In this scenario, the target intentionally sends all records to the same sink, regardless of the stream name. The stream name will likely be made an attribute of the final output, but records do not need to be segregated by the stream name.- Example: a json file writer where the desired output is a single combined json file with all records from all streams.
By default, the current implementation triggers a full drain and flush of all sink objects when a STATE message is received. This implementation prioritizes emitting the provided STATE message, and is well tuned for scenarios where only one stream is being sent at a time (the vast majority).
Lower priority: To optimize for randomized streams, we may in the future add an alternate implementation which instead prioritizes for fewer drain operations or a minimum batch size for the drain process to be triggered.
Included samples
I have two samples so far which I'm using for testing the interface. Both use the lazy drain()
method to write records all at once, rather than writing data during load_record()
.`
How to test during the preview
The easiest way to test is to run the following from your own Target repo:
- Run
poetry remove singer-sdk
within your existing sdk-based repo or simply comment-out thesinger-sdk
line in your repo'spyproject.toml
. - According to your preference:
- Depend on this branch (
feature/target-base
):poetry add git+https://gitlab.com/meltano/singer-sdk.git#feature/target-base
- Clone and make a local dev dependency:
poetry add --dev ../singer-sdk/
(assumessinger-sdk
is a sibling directory).
- Depend on this branch (