the HiPress performance atop PyTorch is poor
I attempted to run HiPress atop PyTorch to test the performance improvement of VGG16
, but the performance is not as expected.
The hardware setup is V100 + NVLink. Each machine has eight GPUs and the network bandwidth is 25Gbps. The script for the test is
run_vgg19.sh but replacing vgg19
with vgg16
.
The training performance is as follows.
GPUs | FP32 | graddrop | terngrad | tbq |
---|---|---|---|---|
1 | 230 img/s | |||
8 | 1560 img/s | 1568 img/s | 1141 img/s | 1149 img/s |
16 | 1149 img/s | 1450 img/s | 804 img/s | 1240 img/s |
There are two issues here:
- For the experiments with 8 GPUs, because HiPress aggregates the original gradients locally before the inter-node communication, the performance of different compression algorithms is expected to be the same as FP32. But terngrad and tbq have much lower training speeds.
- For the experiments with 16 GPUs, the performance of terngrad is even poorer than FP32.
I followed the steps in README to install HiPress and torch-hipress-extension
. I was wondering if I had missed any important steps for the optimization.
One possible missing step could be Step4: Generate compressing plan with SeCoPa
, but there is no comprplan
argument option in pytorch_imagenet.py and nothing is mentioned in the instructions for how to run examples atop Pytorch.
Looking forward to your reply!