DDP时出现message truncated : receiving 1048576 bytes instead of 524288的解决方案 DDP

mowen 2024-01-30 1226

transport/net_socket.cc:483 NCCL WARN NET/Socket : peer 192.168.5.168<60166> message truncated : receiving 1048576 bytes instead of 524288. If you believe your socket network is in healthy state,           there may be a mismatch in collective sizes or environment settings (e.g. NCCL_PROTO, NCCL_ALGO) between ranks

DDP时训练节点主机环境不一致导致
重新安装所有节点的nccl

https://docs.nvidia.com/deeplearning/nccl/archives/nccl_2193/install-guide/index.html
cuda-cccl-11-8/unknown,now 11.8.89-1 amd64 [installed,automatic]
cuda-command-line-tools-11-8/unknown,now 11.8.0-1 amd64 [installed]
cuda-compat-11-8/unknown,now 520.61.05-1 amd64 [installed]
cuda-compiler-11-8/unknown,now 11.8.0-1 amd64 [installed,automatic]
cuda-cudart-11-8/unknown,now 11.8.89-1 amd64 [installed]
cuda-cudart-dev-11-8/unknown,now 11.8.89-1 amd64 [installed]
cuda-cuobjdump-11-8/unknown,now 11.8.86-1 amd64 [installed,automatic]
cuda-cupti-11-8/unknown,now 11.8.87-1 amd64 [installed,automatic]
cuda-cupti-dev-11-8/unknown,now 11.8.87-1 amd64 [installed,automatic]
cuda-cuxxfilt-11-8/unknown,now 11.8.86-1 amd64 [installed,automatic]
cuda-driver-dev-11-8/unknown,now 11.8.89-1 amd64 [installed,automatic]
cuda-gdb-11-8/unknown,now 11.8.86-1 amd64 [installed,automatic]
cuda-keyring/unknown,now 1.0-1 all [installed,upgradable to: 1.1-1]
cuda-libraries-11-8/unknown,now 11.8.0-1 amd64 [installed]
cuda-libraries-dev-11-8/unknown,now 11.8.0-1 amd64 [installed]
cuda-memcheck-11-8/unknown,now 11.8.86-1 amd64 [installed,automatic]
cuda-minimal-build-11-8/unknown,now 11.8.0-1 amd64 [installed]
cuda-nsight-compute-11-8/unknown,now 11.8.0-1 amd64 [installed]
cuda-nvcc-11-8/unknown,now 11.8.89-1 amd64 [installed,automatic]
cuda-nvdisasm-11-8/unknown,now 11.8.86-1 amd64 [installed,automatic]
cuda-nvml-dev-11-8/unknown,now 11.8.86-1 amd64 [installed]
cuda-nvprof-11-8/unknown,now 11.8.87-1 amd64 [installed]
cuda-nvprune-11-8/unknown,now 11.8.86-1 amd64 [installed,automatic]
cuda-nvrtc-11-8/unknown,now 11.8.89-1 amd64 [installed,automatic]
cuda-nvrtc-dev-11-8/unknown,now 11.8.89-1 amd64 [installed,automatic]
cuda-nvtx-11-8/unknown,now 11.8.86-1 amd64 [installed]
cuda-profiler-api-11-8/unknown,now 11.8.86-1 amd64 [installed,automatic]
cuda-sanitizer-11-8/unknown,now 11.8.86-1 amd64 [installed,automatic]
cuda-toolkit-11-8-config-common/unknown,now 11.8.89-1 all [installed,automatic]
cuda-toolkit-11-config-common/unknown,now 11.8.89-1 all [installed,automatic]
cuda-toolkit-config-common/unknown,now 12.3.52-1 all [installed,upgradable to: 12.3.101-1]
libnccl-dev/unknown,now 2.15.5-1+cuda11.8 amd64 [installed,upgradable to: 2.19.3-1+cuda12.3]
libnccl2/unknown,now 2.15.5-1+cuda11.8 amd64 [installed,upgradable to: 2.19.3-1+cuda12.3]


最新回复 (0)
返回
发新帖
X