Nvidia常见问题及解决办法

2024-11-26

4分钟阅读时长

GPU ECC 问题处理

1.查询nvidia-smi -q -d PAGE_RETIREMENT找到对应报错GPU及pending状态，确认是no还是yes，如是Yes 表示有需要被屏蔽的ECC报错地址，需要重启系统或重置GPU（nvidia-smi -r -i “对应gpu编号0-7”）使其变为No

2.屏蔽报错地址后程序仍受 ECC 报错影响， GPU retired pages 计数满足 NVIDIA RMA 标准，则由我们完成fieldiag 检测，测试 FAIL 则联系供货商进行 GPU 更换

3.对于 Volatile 和 Aggregate 条目下出现的 GPU ECC 报错，可使用 nvidia-smi -p 0 或 nvidia-smi -p 1 进行清除

4.GPU卡一直ECC，关闭指定卡的ecc nvidia-smi -e 0

Cuda p2p test fails / nvswitch ERROR /nccl tests performance is low

problem:

cuda p2p test fails / nvswitch ERROR / nccl tests performance is low

solutions:

1.re-register GPU

nvidia-smi -r

stop GPU-related services , otherwise, the gpu cannot be re-registered

systemctl stop docker.dcgm-exporter.service
systemctl stop docker.node-exporter.service
systemctl stop nvidia-fabricmanager.service
systemctl stop nvidia-dcgm.service
systemctl stop nvidia-persistenced.service

then re-register GPU

nvidia-smi -r

Start the GPU-related services again

systemctl start docker.dcgm-exporter.service
systemctl start docker.node-exporter.service
systemctl.start nvidia-fabricmanager.service
systemctl start nvidia-dcgm.service
systemctl start nvidia-persistenced.service

2.If the above method does not solve the problem, run the FD hardware test, the test passes, indicating that the hardware is good, the test fails, we will consider replacing the gpu module

XID 79

problem:

GPU has fallen off the bus

This error is typically caused by GPU driver or hardware issues,users may notice GPU instance disconnections.

solutions:

In most cases, restarting the system can resolve the issue.

Observe the output generated by nvidia-smi . Sometimes, the program does not exit the GPU properly, resulting in the GPU still being occupied by a process, high GPU memory usage, or the process exiting but not releasing GPU memory , stop GPU-related services before starting them

systemctl stop docker.dcgm-exporter.service
systemctl stop docker.node-exporter.service
systemctl stop nvidia-fabricmanager.service
systemctl stop nvidia-dcgm.service
systemctl stop nvidia-persistenced.service

systemctl start docker.dcgm-exporter.service
systemctl start docker.node-exporter.service
systemctl start nvidia-fabricmanager.service
systemctl start nvidia-dcgm.service
systemctl start nvidia-persistenced.service

If restarting GPU-related services does not solve the problem,then restart the system.

3.The system is started into the PE system of the flash disk on the server, and an FD hardware test is done. If it fails, we will disassemble and check the GPU module and fastening screws (to eliminate the problem of poor hardware contact), and if the test fails again, we will consider replacing the gpu module . If it passes, it indicates that the hardware is OK.

Xid 119

problem:

GSP RPC Timeout

This error is usually caused by a GPU system processor (GSP) bug triggered by the GPU driver.

solutions：

Disable GSP：

echo "options nvidia NVreg_EnableGpuFirmware=0" > /etc/modprobe.d/nvidia-gsp.conf
cp /boot/initramfs-$(uname -r).img /boot/initramfs-$(uname -r).img.bak
update-initramfs -u

Then restart the system.

Check whether GSP is disabled successfully: Check whether the value is 0. If the value is 0, GSP is disabled.

grep EnableGpuFirmware /proc/driver/nvidia/params

Other xid errors

A complete reference to https://docs.nvidia.com/deploy/xid-errors/index.html

XID 74: Nvlink ERROR

This error indicates that the GPU has detected a problem with the connection from the GPU to another GPU or NVSwitch through NVLink, which may be an exception to the GPU itself or an exception to the interconnected GPU card.

XID 92: High single-bit ECC error rate

This Error indicates a high single-bit ECC Error, possibly a hardware or driver failure.

XID 48: Double Bit ECC Error

The Xid 48 event is reported when an uncorrectable error occurs on the GPU. The error is also fed back to the user’s application. It is usually necessary to reset the GPU or restart the CVM instance to clear this error.

XID 95: Uncontained ECC error

This error indicates that the GPU has an unincluded ECC error, and applications involving GPU cards are stopped.

Automatic system restart

If the BMC detects that the cpu is repeatedly invoked incorrectly or stuck for a long period of time, the BMC resets the system and displays an alarm before the restart.Restarting the machine will restore it to normal.

Check and repair the IB network adapter down status

Checking Port Status

ibstat Displays the port status

Physical state: LinkUp(Physical connection is OK)-
State:down(The logical link times out and is not established)
State:polling(waiting for SM to request synchronization)

Solution (3 Methods)

ibportstate -C mlx5_x -P 1 reset The port restarts
mlxfwreset -d mlx5_x reset The NIC restarts
Hardware insertion and removal