【Failed to initialize NVML Driver/library version mismatch】
查看系统安装驱动版本是否不匹配(三处显示版本一致即正常)
# 查看显卡驱动所使用的内核版本
cat /proc/driver/nvidia/version
# 查看系统驱动日志
cat /var/log/dpkg.log | grep nvidia-compute-utils
# 查看驱动程序
dpkg -l | grep nvidia
卸载nvidia驱动
apt purge nvidia-*
apt purge libnvidia-*
apt autoremove
安装驱动
https://developer.nvidia.com/cuda-toolkit-archive
【Failed to initialize NVML: Unknown Error】
3台server,输入:
docker info|grep -i cgroup
ServerA:
Cgroup Driver: cgroupfs
Cgroup Version: 2
cgroupns
ServerB、ServerC:
Cgroup Driver: systemd
Cgroup Version: 2
cgroupns
其中ServerA从未掉过卡,ServerB、ServerC只要在宿主机输入systemctl daemon-reload,容器内部会立即掉卡,显示
No devices were found
或
Failed to initialize NVML: Unknown Error
解决方案:
查看docker的Cgroup Driver是否是cgroupfs:
docker info|grep -i cgroup
Cgroup Driver: systemd
如果是cgroupfs,则不会出现该问题;如果是systemd则执行以下步骤:
vim /etc/docker/daemon.json
{
"exec-opts": [
"native.cgroupdriver=cgroupfs"
],
"runtimes": {
"nvidia": {
"args": [],
"path": "nvidia-container-runtime"
}
}
}
systemctl restart docker