GPU memory overconsumption due to the tiny_buffer structure
- Description of the problem:
The use of the tiny_buffer structure leads to significant GPU memory overconsumption for at least a few PW test cases that could pass with less GPUs. I do not know how to fix the problem properly but thought it would be worth starting a discussion with the elements I found.
For example, in addusdens_gpu.f90, replacing dev_buf%lock_buffer(aux2_d,...)
with regular allocate(aux2_d)
leads the Ta2O5 test case to consume 10 GB less GPU memory per K-point with no visible performance overhead from the allocation.
My understanding is that properly fixing this issue would involve changes in deviceXlib to handle fragmentation, but as a temporary workaround I suggest to consider maybe replacing lock_buffer just for this specific array as it seems to always be very large and immediatly deallocated after the routine.
- Reproduction steps:
With release qe-gpu-6.5a1
, I ran CsI on V100 16 GB and monitored device memory used without and with the code change. The test case requires 4 GPUs to run but with just this fix it can fit on only 3 GPUs.