The source project of this merge request has been removed.
optimize predux if architecture is aarch64
What does this implement/fix?
This PR is going to optimize predux, predux_min and predux_max.
When NEON is in aarch64, we can use v(add|min|max)v
intrinsic to do reduction, because use a lot of vp(add|min|max)
will slow the performance.