Cross attention should be over the whole seq and smaller seq
In your code you split the sequence into a prefix and smaller window and calculate the cross-attention with respect to it...
However, in the diagram of the method, the whole sequence is used for the V and K... Can you kindly confirm?
Thank you!