Nvidia常见问题及解决办法
GPU ECC 问题处理
1.查询nvidia-smi -q -d PAGE_RETIREMENT找到对应报错GPU及pending状态,确认是no还是yes,如是Yes 表示有需要被屏蔽的ECC报错地址,需要重启系统或重置GPU(nvidia-smi -r -i “对应gpu编号0-7”)使其变为No
2.屏蔽报错地址后程序仍受 ECC 报错影响, GPU retired pages 计数满足 NVIDIA RMA 标准,则由我们完成fieldiag 检测,测试 FAIL 则联系供货商进行 GPU 更换
3.对于 Volatile 和 Aggregate 条目下出现的 GPU ECC 报错,可使用 nvidia-smi -p 0 或 nvidia-smi -p 1 进行清除
4.GPU卡一直ECC,关闭指定卡的ecc nvidia-smi -e 0
Cuda p2p test fails / nvswitch ERROR /nccl tests performance is low
problem:
cuda p2p test fails / nvswitch ERROR / nccl tests performance is low
solutions:
1.re-register GPU
nvidia-smi -r
stop GPU-related services , otherwise, the gpu cannot be re-registered
systemctl stop docker.dcgm-exporter.service
systemctl stop docker.node-exporter.service
systemctl stop nvidia-fabricmanager.service
systemctl stop nvidia-dcgm.service
systemctl stop nvidia-persistenced.service
then re-register GPU
nvidia-smi -r
Start the GPU-related services again
systemctl start docker.dcgm-exporter.service
systemctl start docker.node-exporter.service
systemctl.start nvidia-fabricmanager.service
systemctl start nvidia-dcgm.service
systemctl start nvidia-persistenced.service
2.If the above method does not solve the problem, run the FD hardware test, the test passes, indicating that the hardware is good, the test fails, we will consider replacing the gpu module
XID 79
problem:
GPU has fallen off the bus
This error is typically caused by GPU driver or hardware issues,users may notice GPU instance disconnections.
solutions:
-
In most cases, restarting the system can resolve the issue.
-
Observe the output generated by
nvidia-smi
. Sometimes, the program does not exit the GPU properly, resulting in the GPU still being occupied by a process, high GPU memory usage, or the process exiting but not releasing GPU memory , stop GPU-related services before starting themsystemctl stop docker.dcgm-exporter.service systemctl stop docker.node-exporter.service systemctl stop nvidia-fabricmanager.service systemctl stop nvidia-dcgm.service systemctl stop nvidia-persistenced.service systemctl start docker.dcgm-exporter.service systemctl start docker.node-exporter.service systemctl start nvidia-fabricmanager.service systemctl start nvidia-dcgm.service systemctl start nvidia-persistenced.service
If restarting GPU-related services does not solve the problem,then restart the system.
3.The system is started into the PE system of the flash disk on the server, and an FD hardware test is done. If it fails, we will disassemble and check the GPU module and fastening screws (to eliminate the problem of poor hardware contact), and if the test fails again, we will consider replacing the gpu module . If it passes, it indicates that the hardware is OK.
Xid 119
problem:
GSP RPC Timeout
This error is usually caused by a GPU system processor (GSP) bug triggered by the GPU driver.
solutions:
Disable GSP:
echo "options nvidia NVreg_EnableGpuFirmware=0" > /etc/modprobe.d/nvidia-gsp.conf
cp /boot/initramfs-$(uname -r).img /boot/initramfs-$(uname -r).img.bak
update-initramfs -u
Then restart the system.
Check whether GSP is disabled successfully: Check whether the value is 0. If the value is 0, GSP is disabled.
grep EnableGpuFirmware /proc/driver/nvidia/params
Other xid errors
A complete reference to https://docs.nvidia.com/deploy/xid-errors/index.html
XID 74: Nvlink ERROR
This error indicates that the GPU has detected a problem with the connection from the GPU to another GPU or NVSwitch through NVLink, which may be an exception to the GPU itself or an exception to the interconnected GPU card.
XID 92: High single-bit ECC error rate
This Error indicates a high single-bit ECC Error, possibly a hardware or driver failure.
XID 48: Double Bit ECC Error
The Xid 48 event is reported when an uncorrectable error occurs on the GPU. The error is also fed back to the user’s application. It is usually necessary to reset the GPU or restart the CVM instance to clear this error.
XID 95: Uncontained ECC error
This error indicates that the GPU has an unincluded ECC error, and applications involving GPU cards are stopped.
Automatic system restart
If the BMC detects that the cpu is repeatedly invoked incorrectly or stuck for a long period of time, the BMC resets the system and displays an alarm before the restart.Restarting the machine will restore it to normal.
Check and repair the IB network adapter down status
Checking Port Status
ibstat
Displays the port status
-
Physical state: LinkUp(Physical connection is OK)-
-
State:down(The logical link times out and is not established)
-
State:polling(waiting for SM to request synchronization)
Solution (3 Methods)
ibportstate -C mlx5_x -P 1 reset
The port restartsmlxfwreset -d mlx5_x reset
The NIC restarts- Hardware insertion and removal