Enabling of NVidia PTX backend for SYCL & nbnxmKernel performance optimizations
Enabled NVidia PTX backend for DPC++, suggested a couple of performance optimizations for nbnxmKernel (mostly for NVidia backend). Performance improvements:
- RTX2070, benchMEM, kernel - 3.8x, whole bench - 10%;
- RTX2070, water/0048, kernel - 3.8x, whole bench - 80%;
- A100, benchMEM, kernel - 2.1x, whole bench - 2%;
- A100, water/0048, kernel - 2x, whole bench - 19%;
- GEN9 - just minor improvements (going to work on it later);