[New CUPTI] Main
Created by: Jokeren
Test
-ck HPCRUN_CUDA_NEW_CUPTI=TRUE
should be applied to enable fast unwinding and range profiling.
Regression
PeleC/PMF stop_time=100, max_time=200, 6.3s
-
hpcrun -e gpu=nvidia
9.0s -
hpcrun -e gpu=nvidia,pc
104s
New CUPTI
PeleC
The following commands should generate databases with the same quality as if without HPCRUN_CUDA_NEW_CUPTI=TRUE
, except that coarse-grained metrics (e.g., GXCOPY
) are not available in the fine-grained profiling mode (i.e., gpu=nvidia,pc
).
-
hpcrun -e gpu=nvidia -ck HPCRUN_CUDA_NEW_CUPTI=TRUE
8.5s -
hpcrun -e gpu=nvidia,pc -ck HPCRUN_CUDA_NEW_CUPTI=TRUE
86s
The following commands use range profiling or fast unwinding to reduce overhead.
-
hpcrun -e gpu=nvidia -ck HPCRUN_CUDA_NEW_CUPTI=TRUE -ck HPCRUN_CUDA_FAST_UNWIND=TRUE
7.5s -
hpcrun -e gpu=nvidia,pc -ck HPCRUN_CUDA_NEW_CUPTI=TRUE -ck HPCRUN_CUDA_RANGE_MODE=CONTEXT_SENSITIVE
44s -
hpcrun -e gpu=nvidia,pc -ck HPCRUN_CUDA_NEW_CUPTI=TRUE -ck HPCRUN_CUDA_RANGE_MODE=TRIE
44s
For unknown reasons, -ck CUDA_SYNC_YIELD=TRUE
does not have any effect on PeleC. We had to modify PeleC source code to change the synchronization policy.
Limitations
- Coarse-grained metrics and fine-grained metrics are profiled separately. Since kernel metrics such as the number of register usage are no longer available in the fine-grained profiling mode, we cannot derive SM efficiency and occupancy metrics.
- Profiling an application without any GPU kernel will hang using the range profiling model.
- [CUPTI Bug] CUPTI PC Sampling API slows down cudaDeviceSynchronize().
- [CUPTI Bug] CUPTI PC Sampling records overflow after profiling a range with many big kernels.
- [CUPTI Bug] CUPTI incurs high profiling overhead (i.e., 1.5x) for Laghos.
- [CUPTI Limitation] CUPTI PC Sampling API does not support profiling multiple contexts using the continuous (i.e., range) mode.
Edited by Jonathon Anderson