Zorro mask logic

Created by: fisheggg

Hi, thanks for the work,

I notice some difference in the zorro mask logic to the paper. In the code it goes like:

        # the logic goes
        # every modality, including fusion can attend to self

        zorro_mask = token_types_attend_from == token_types_attend_to

        # fusion can attend to everything

        zorro_mask = zorro_mask | token_types_attend_from == TokenTypes.FUSION.value

        # and both specific modalities like audio and video can attend to fusion

        zorro_mask = zorro_mask | token_types_attend_to == TokenTypes.FUSION.value

But in the paper at page 4, chapter Masked attetion, the author explains:

... our modality-specific representation does not have access to the global representation, ... Specifically, we set m_{ij}=1 if j is a part of the fusion representation, otherwise we only set m_{ij}=1 if i and j are vectores of the same modality.

Which suggests that the third part of mask isn't needed. Hope I understood the paper correctly.

Best, Arthur