Training got stuck when I used DistributedDataParallel mode but dataParallel mode is useful

Created by: wuqi930907

Hi,I have builded a docker image according to the Dockerfile. But,my training got stuck with command:“python -m torch.distributed.launch --nproc_per_node 2 train.py --batch-size 32 --data test.yaml --weights pretrained_model/yolov5l.pt”.Following is my log. image Although the terminal log has stopped,but the training process still exists. image image

Then, I changed my command:"python train.py --batch-size 32 --data test.yaml --weights pretrained_model/yolov5m.pt --device 0,1", everything is normal.