Skip to content

Target sink - optimization strategies for when to flush batches

While draining or flushing a collection of target sinks, priority could be given to a number of different values:

  1. Drain often for benefit of:
    1. reduced memory pressure by flushing stored records
    2. reduced latency, at least for earlier-emitted record(s)
    3. more frequent checkpoints, aka increased frequency of STATE messages emitted
  2. Drain less often for benefit of:
    1. Fewer overall batches
    2. Efficiency of bulk loading records at high scale
    3. Lower costs on destination platforms which may charged or metered per batch
      • for instance Snowflake charges less when running 1 minute out of every 15 versus running intermittently for all 15 minutes.
  3. Other factors to consider:
    1. defining 'full' for each sink
      • Each sink should report when it is full, either by writing custom is_full() logic or else by specifying a max record count.
    2. controlling max per-record latency
      • We may want to provide a max per-record latency threshold - for instance, prioritizing a batch to be loaded if contains one ore more record in queue for over 120 minutes.
    3. draining multiple sinks when one is triggered ('toppling')
      • When one sink is being drained, we may want to opportunistically drain all others at the same time. This could have benefits for metering and platform costs.
      • For instance, it is cheaper in the Snowflake case to flush all at once and have fewer minutes of each hour running batches.
      • Draining all sinks also allows us to flush the stored state message(s).
    4. memory pressure
      • If memory pressure is detected, this might force the flush of one or more streams

Our strategy for this (broadly) should probably be to have at least two layers:

  • The developer provides some default logic or prioritization strategy that is tuned to work well for the destination system.
  • The user may optionally have some ability to override at runtime using config options.
Edited by AJ Steers