Target sink - optimization strategies for when to flush batches
While draining or flushing a collection of target sinks, priority could be given to a number of different values:
- Drain often for benefit of:
- reduced memory pressure by flushing stored records
- reduced latency, at least for earlier-emitted record(s)
- more frequent checkpoints, aka increased frequency of STATE messages emitted
- Drain less often for benefit of:
- Fewer overall batches
- Efficiency of bulk loading records at high scale
- Lower costs on destination platforms which may charged or metered per batch
- for instance Snowflake charges less when running 1 minute out of every 15 versus running intermittently for all 15 minutes.
- Other factors to consider:
- defining 'full' for each sink
- Each sink should report when it is full, either by writing custom
is_full()
logic or else by specifying a max record count.
- Each sink should report when it is full, either by writing custom
- controlling max per-record latency
- We may want to provide a max per-record latency threshold - for instance, prioritizing a batch to be loaded if contains one ore more record in queue for over 120 minutes.
- draining multiple sinks when one is triggered ('toppling')
- When one sink is being drained, we may want to opportunistically drain all others at the same time. This could have benefits for metering and platform costs.
- For instance, it is cheaper in the Snowflake case to flush all at once and have fewer minutes of each hour running batches.
- Draining all sinks also allows us to flush the stored state message(s).
- memory pressure
- If memory pressure is detected, this might force the flush of one or more streams
- defining 'full' for each sink
Our strategy for this (broadly) should probably be to have at least two layers:
- The developer provides some default logic or prioritization strategy that is tuned to work well for the destination system.
- The user may optionally have some ability to override at runtime using config options.
Edited by AJ Steers