Pair-wise tensor contraction design issue

I would like to clarify what is the "expected behavior" of pair-wise tensor contraction. This is related to #201 (closed) and #202 (closed).

Consider two tensors, with row index r1, r2 and column index c1, c2:

A[ar1, ar2; ac1, ac2]
B[br1, br2; bc1, bc2]

Suppose we contract ar2 with bc1 and ac1 with br2. We first permute A and B

A[ar1, ac2; X, Y]
B[Y, X; br1, bc2] (or B[ X, Y; br1, bc2]??)

Then perform matrix-like tensor trace, results in

AB[ar1, ac2; br1, bc2]

Then permute to

AB[ar1, br1; ac2, bc2]

In uni10_v2 we didn't perform the last permute (back to natural order) do avoid the extra actual memory swap. Now if we separate the label permute and actual memory swap, what is the expected behavior?