Discussion: Reconsider serialization method for partition-based state

When a tap contains multiple partitions, the state tracking for each partition is stored in a partitions list inside the stream's bookmark. This recently broke, and appears to be being created as a map instead of a list. However, there are advantages to storing this as a map and I'm opening this issue to discuss pros/cons and to decide if we should move to a map-based implementation permanently instead of a list-based implementation.

Consideration 1: Scalability and random-access

Merging, deduping, and traversing partitions is slower if not stored in map format. Implementations will likely need to convert to a map in order to efficiently access partition states in memory, which then needs to be serialized again back into list format every time a STATE message is to be emitted. The continual conversion of the partition states between list and map formats has a performance cost.

Consideration 2: Interop with Meltano and other Singer orchestrators

The default state merge() behavior in Meltano requires that states be traversable as nested maps. More discussion here:

Related to Singer "Composable states" working group item here: https://github.com/MeltanoLabs/Singer-Working-Group/issues/6#issuecomment-1016850184

Consideration 3: Need for deterministic string keys for each partition

One challenge with trying to create a map-based implementation is that we'd need to be able to create a deterministic string key for each partition. (Whereas now the state index can be its own dictionary of any number of keys or data types).

For example, a partition index could be {'region': 'west', 'city': 'Seattle'} or {'project': 'meltano/meltano', 'issue': 12345}.

Consideration 4: Backwards compatibility

If we want to change the schema, we can implement a conversion process in the default implementation of load_state() here: https://gitlab.com/meltano/sdk/blob/main/singer_sdk/tap_base.py#L274

Edited Jan 20, 2022 by AJ Steers