Missing tests for non-square kernel transpose

Describe the feature you would like to be implemented.

ptranspose tests for non-square kernels

Would such a feature be useful for other users? Why?

Any hints on how to implement the requested feature?

Additional resources

Currently ptranpose is tested only with this piece of code:

  internal::PacketBlock<Packet> kernel;
  for (int i = 0; i < PacketSize; ++i) {
    kernel.packet[i] = internal::pload<Packet>(data1 + i * PacketSize);
  }
  ptranspose(kernel);
....

This code creates only squared kernels and thus specializations for Nx4 kernels which are present in multiple architectures are not tested. This can be a problem when testing new architecture with a wrong implementation of a non-square kernel tranpose. packetmath tests will pass but product_small tests will fail for example which can be confusing. I volunteer to add this test.