GPU memory overconsumption due to the tiny_buffer structure

Description of the problem:

The use of the tiny_buffer structure leads to significant GPU memory overconsumption for at least a few PW test cases that could pass with less GPUs. I do not know how to fix the problem properly but thought it would be worth starting a discussion with the elements I found.

For example, in addusdens_gpu.f90, replacing dev_buf%lock_buffer(aux2_d,...) with regular allocate(aux2_d) leads the Ta2O5 test case to consume 10 GB less GPU memory per K-point with no visible performance overhead from the allocation.

My understanding is that properly fixing this issue would involve changes in deviceXlib to handle fragmentation, but as a temporary workaround I suggest to consider maybe replacing lock_buffer just for this specific array as it seems to always be very large and immediatly deallocated after the routine.

Reproduction steps:

With release qe-gpu-6.5a1, I ran CsI on V100 16 GB and monitored device memory used without and with the code change. The test case requires 4 GPUs to run but with just this fix it can fit on only 3 GPUs.

Edited Aug 28, 2020 by Louis Stuber

To upload designs, you'll need to enable LFS and have an admin enable hashed storage. More information