A主节点机器为windows11+wsl2,假设地址192.168.1.5,端口8888
B子节点机器为ubuntu22.04+docker,假设地址192.168.1.6,端口8888
设置A主节点的init_process_group(backend="nccl",init_method="tcp://192.168.1.5:8888",rank=0,world_size=2)
设置B子节点的docker子容器为host网络,同时设置init_process_group(backend="nccl",init_method="tcp://192.168.1.5:8888",rank=1,world_size=2)
如果B节点的docker子容器为bridge网络,那么ifname只能是容器内的eth0=172.17.x.x,这样的话A主节点无法主动连接到172.17.x.x,会导致DDP的init_process_group()超时失败!
所以正确做法是:B节点的docker子容器为host网络,同时ifname设置为所在物理机的真实网卡,即和A主机所在网络可以互联。
local_rank:单个节点需要使用的本机GPU唯一索引,如果是单机四卡,那么该机器上需要运行4个节点,local_rank分别为0、1、2、3
init_process_group的rank:节点间的唯一索引
init_process_group的world_size:所有节点的GPU总数,如果是4机8卡,那么world_size=4x8=32
WSL2安装nccl,需先注册:
wget https://developer.download.nvidia.com/compute/cuda/repos/ubuntu2204/x86_64/cuda-keyring_1.1-1_all.deb dpkg -i cuda-keyring_1.1-1_all.deb sudo apt update && apt install libnccl2 libnccl-dev
查看nccl是否安装成功
find /usr -name libnccl.so* /usr/lib/x86_64-linux-gnu/libnccl.so.2 /usr/lib/x86_64-linux-gnu/libnccl.so.2.19.3 /usr/lib/x86_64-linux-gnu/libnccl.so
下载nccl-tests编译测试nccl
git clone https://github.com/NVIDIA/nccl-tests.git cd nccl-test && make
测试nccl
单机单卡 ./build/all_reduce_perf -b 8 -e 128M -f 2 -g 1 单机多卡 ./build/all_reduce_perf -b 8 -e 128M -f 2 -g GPU数量
# nThread 1 nGpus 1 minBytes 8 maxBytes 134217728 step: 2(factor) warmup iters: 5 iters: 20 agg iters: 1 validation: 1 graph: 0 # # Using devices # Rank 0 Group 0 Pid 146864 on ws device 0 [0x01] NVIDIA GeForce RTX 3060 # # out-of-place in-place # size count type redop root time algbw busbw #wrong time algbw busbw #wrong # (B) (elements) (us) (GB/s) (GB/s) (us) (GB/s) (GB/s) 8 2 float sum -1 66.94 0.00 0.00 0 0.11 0.07 0.00 0 16 4 float sum -1 66.76 0.00 0.00 0 0.16 0.10 0.00 0 32 8 float sum -1 66.68 0.00 0.00 0 0.14 0.23 0.00 0 64 16 float sum -1 65.82 0.00 0.00 0 0.10 0.61 0.00 0 128 32 float sum -1 67.35 0.00 0.00 0 0.11 1.19 0.00 0 256 64 float sum -1 67.83 0.00 0.00 0 0.11 2.37 0.00 0 512 128 float sum -1 68.40 0.01 0.00 0 0.11 4.58 0.00 0 1024 256 float sum -1 59.60 0.02 0.00 0 0.11 9.39 0.00 0 2048 512 float sum -1 64.88 0.03 0.00 0 0.11 18.09 0.00 0 4096 1024 float sum -1 64.29 0.06 0.00 0 0.22 18.92 0.00 0 8192 2048 float sum -1 64.31 0.13 0.00 0 0.18 46.09 0.00 0 16384 4096 float sum -1 71.41 0.23 0.00 0 0.21 79.59 0.00 0 32768 8192 float sum -1 64.16 0.51 0.00 0 0.12 275.36 0.00 0 65536 16384 float sum -1 76.67 0.85 0.00 0 0.14 462.99 0.00 0 131072 32768 float sum -1 92.70 1.41 0.00 0 0.21 622.23 0.00 0 262144 65536 float sum -1 65.35 4.01 0.00 0 0.11 2356.35 0.00 0 524288 131072 float sum -1 76.23 6.88 0.00 0 0.11 4566.97 0.00 0 1048576 262144 float sum -1 70.77 14.82 0.00 0 0.12 8727.22 0.00 0 2097152 524288 float sum -1 44.88 46.72 0.00 0 0.12 17490.84 0.00 0 4194304 1048576 float sum -1 90.52 46.34 0.00 0 0.11 37349.10 0.00 0 8388608 2097152 float sum -1 147.4 56.92 0.00 0 0.11 75915.00 0.00 0 16777216 4194304 float sum -1 261.8 64.09 0.00 0 0.12 143640.55 0.00 0 33554432 8388608 float sum -1 528.4 63.50 0.00 0 0.11 297600.28 0.00 0 67108864 16777216 float sum -1 1046.7 64.12 0.00 0 0.11 619657.10 0.00 0 134217728 33554432 float sum -1 2333.1 57.53 0.00 0 0.13 1009155.85 0.00 0 # Out of bounds values : 0 OK # Avg bus bandwidth : 0 #
最新回复 (0)