Optimize division operations in TensorVolumePatch.h
What does this implement/fix?
If the PacketSize=1, both the entries in indices[] point to the same index. We can reduce the division operation in such cases to just 1 from 2.
Additional information
This should reduce the number of CPU cycles consumed for running the division operation.