Intel oneAPI and LevelZero: exception thrown at queue destruction
Summary
When DeviceStream
gets destroyed, the sycl::queue
destructor sometimes throws an exception:
terminate called after throwing an instance of 'cl::sycl::runtime_error'
what(): Native API failed. Native API returns: -999 (Unknown OpenCL error code) -999 (Unknown OpenCL error code)
It happens at queue destruction only, so should not affect simulations. It affects tests (where queues are created and destroyed often), and the very end of the mdrun
.
- Happens with oneAPI 2022.0.1 and open-source IntelLLVM 02b730102d7911ea466a2df6d8837971140d430f. Does not happen with oneAPI 2021.4.0
- Happens with SYCL_DEVICE_FILTER=level_zero:gpu, does not happen with OpenCL (the default).
- Setting
SYCL_PI_LEVEL_ZERO_BATCH_SIZE=1
seems to resolve the issue. Setting higher values (e.g., 8) reduces the probability of it happening. - Oversubscribing GPU with high
-ntmpi
value increases the likelihood of the problem. - Happens on multiple devices (Gen9 and Gen11 iGPU, XeMax dGPU).
- Sometimes, the program deadlocks instead of throwing an exception, with stacktrace deep in
libze_intel_gpu.so
). Rarely, it happens even if the L0 device was not actively used, being poked during GPU enumeration is enough (which always happens during the program start).- Setting
SYCL_DEVICE_FILTER=opencl:gpu
hides the L0 backend, and resolves the issue.
- Setting
- Sometimes, the program fails with
Abort was called at 178 line in file: ../neo/opencl/source/os_interface/linux/drm_command_stream.inl
, from withinpiQueueRelease
, instead. - Adding
stream_.wait()
toDeviceStream::~DeviceStream
leads either to a deadlock (onpiQueueRelease
), or to the same exception being thrown fromcl::sycl::detail::queue_impl::wait
instead of the destructor. - Happens with libze_intel_gpu.so.1.2.22192, libze_intel_gpu.so.1.2.21786, libze_intel_gpu.so.1.1.20609.
Exact steps to reproduce
- Configure GROMACS with oneAPI 2022.0.1:
cmake ../.. -DCMAKE_C_COMPILER=icx -DCMAKE_CXX_COMPILER=icpx -DGMX_GPU=SYCL -DGMX_GPU_NB_CLUSTER_SIZE=4 -DCMAKE_BUILD_TYPE=RelWithDebInfo -DGMX_FFT_LIBRARY=mkl -DREGRESSIONTEST_DOWNLOAD=ON
- Run
SYCL_CACHE_PERSISTENT=1 SYCL_DEVICE_FILTER=level_zero:gpu make check
-
SYCL_CACHE_PERSISTENT
is not necessary for this bug to reproduce, it's here to speed-up the tests, see #4218.
-
Possible fixes
We can force setting SYCL_PI_LEVEL_ZERO_BATCH_SIZE=1
, or at least warn the user?