Discussion: Reconsider serialization method for partition-based state
When a tap contains multiple partitions, the state tracking for each partition is stored in a partitions
list inside the stream's bookmark. This recently broke, and appears to be being created as a map instead of a list. However, there are advantages to storing this as a map and I'm opening this issue to discuss pros/cons and to decide if we should move to a map-based implementation permanently instead of a list-based implementation.
Consideration 1: Scalability and random-access
Merging, deduping, and traversing partitions is slower if not stored in map format. Implementations will likely need to convert to a map in order to efficiently access partition states in memory, which then needs to be serialized again back into list format every time a STATE message is to be emitted. The continual conversion of the partition states between list and map formats has a performance cost.
Consideration 2: Interop with Meltano and other Singer orchestrators
The default state merge()
behavior in Meltano requires that states be traversable as nested maps. More discussion here:
Related to Singer "Composable states" working group item here: https://github.com/MeltanoLabs/Singer-Working-Group/issues/6#issuecomment-1016850184
Consideration 3: Need for deterministic string keys for each partition
One challenge with trying to create a map-based implementation is that we'd need to be able to create a deterministic string key for each partition. (Whereas now the state index can be its own dictionary of any number of keys or data types).
For example, a partition index could be {'region': 'west', 'city': 'Seattle'}
or {'project': 'meltano/meltano', 'issue': 12345}
.
Consideration 4: Backwards compatibility
If we want to change the schema, we can implement a conversion process in the default implementation of load_state()
here: https://gitlab.com/meltano/sdk/blob/main/singer_sdk/tap_base.py#L274