the HiPress performance atop PyTorch is poor

I attempted to run HiPress atop PyTorch to test the performance improvement of VGG16, but the performance is not as expected.

The hardware setup is V100 + NVLink. Each machine has eight GPUs and the network bandwidth is 25Gbps. The script for the test is run_vgg19.sh but replacing vgg19 with vgg16.

The training performance is as follows.

GPUs	FP32	graddrop	terngrad	tbq
1	230 img/s
8	1560 img/s	1568 img/s	1141 img/s	1149 img/s
16	1149 img/s	1450 img/s	804 img/s	1240 img/s

There are two issues here:

For the experiments with 8 GPUs, because HiPress aggregates the original gradients locally before the inter-node communication, the performance of different compression algorithms is expected to be the same as FP32. But terngrad and tbq have much lower training speeds.
For the experiments with 16 GPUs, the performance of terngrad is even poorer than FP32.

I followed the steps in README to install HiPress and torch-hipress-extension. I was wondering if I had missed any important steps for the optimization.

One possible missing step could be Step4: Generate compressing plan with SeCoPa, but there is no comprplan argument option in pytorch_imagenet.py and nothing is mentioned in the instructions for how to run examples atop Pytorch.

Looking forward to your reply!