Skip to content

Preview: Target SDK release

AJ Steers requested to merge feature/target-base into main

Closes #96 (closed) Partially implements #63 (closed) Includes and replaces !92 (closed)

Summary of implementation, from the docs here in this branch:

Create targets with singer-sdk requires overriding just two classes:

  1. The Target class. This class governs configuration, validation, and stream discovery.
  2. The Sink class. This class is responsible for writing records to the target and keeping tally of written records. Each Sink implementation my write records immediately in Sink.load_record() or in batches during Sink.drain().

This writeup describes the need to decouple the Sink and Stream classes:

While the default implementation will create one Sink per stream, this is not required. The Stream:Sink mapping behavior can be overwritten in the following ways:

  • 1:1: This is the default, where only one sink is active for each incoming stream name.
    • The exception to this rule is when a knew STATE message is received for an already-active stream. In this case, the existing sink will be marked to be drained and a new sink will be initialized to receive the next incoming records.
    • In the case that a sink is archived because of a superseding STATE message, all prior version(s) of the stream's sink are guaranteed to be drained in creation order.
    • Example: a database-type target where each stream will land in a dedicated table.
  • 1:many: In this scenario, the target intentionally creates multiple sinks per stream. The developer may override Target.get_sink() and use details within the record (or a randomization algorithm) to send records to multiple sinks all for the same stream.
    • Example: a data lake target where output files should be pre-partitioned according to one or more attributes within the record. Multiple smaller files, named according to their partition key values are more efficient than fewer larger files.
  • many:1: In this scenario, the target intentionally sends all records to the same sink, regardless of the stream name. The stream name will likely be made an attribute of the final output, but records do not need to be segregated by the stream name.
    • Example: a json file writer where the desired output is a single combined json file with all records from all streams.

By default, the current implementation triggers a full drain and flush of all sink objects when a STATE message is received. This implementation prioritizes emitting the provided STATE message, and is well tuned for scenarios where only one stream is being sent at a time (the vast majority).

Lower priority: To optimize for randomized streams, we may in the future add an alternate implementation which instead prioritizes for fewer drain operations or a minimum batch size for the drain process to be triggered.

Included samples

I have two samples so far which I'm using for testing the interface. Both use the lazy drain() method to write records all at once, rather than writing data during load_record().`

How to test during the preview

The easiest way to test is to run the following from your own Target repo:

  1. Run poetry remove singer-sdk within your existing sdk-based repo or simply comment-out the singer-sdk line in your repo's pyproject.toml.
  2. According to your preference:
    • Depend on this branch (feature/target-base): poetry add git+
    • Clone and make a local dev dependency: poetry add --dev ../singer-sdk/ (assumes singer-sdk is a sibling directory).
Edited by AJ Steers

Merge request reports