Zorro mask logic
Created by: fisheggg
Hi, thanks for the work,
I notice some difference in the zorro mask logic to the paper. In the code it goes like:
# the logic goes
# every modality, including fusion can attend to self
zorro_mask = token_types_attend_from == token_types_attend_to
# fusion can attend to everything
zorro_mask = zorro_mask | token_types_attend_from == TokenTypes.FUSION.value
# and both specific modalities like audio and video can attend to fusion
zorro_mask = zorro_mask | token_types_attend_to == TokenTypes.FUSION.value
But in the paper at page 4, chapter Masked attetion, the author explains:
... our modality-specific representation does not have access to the global representation, ... Specifically, we set
m_{ij}=1ifjis a part of the fusion representation, otherwise we only setm_{ij}=1ifiandjare vectores of the same modality.
Which suggests that the third part of mask isn't needed. Hope I understood the paper correctly.
Best, Arthur