当使用官方版本的vGPU驱动时,默认不会安装CUDA等外围
因此在宿主机启用vGPU时,CUDA等功能无法使用
但在某些特殊场景下,又希望在宿主机上同时使用N卡的CUDA能力与vGPU
一个显著的特征就是,CUDA Version一栏为N/A
安装大家常用的pytorch尝试调用
为了避免pytorch污染宿主机环境,此处使用venv对环境进行隔离
随后创建venv环境
python3 -m venv venv
安装pytorch环境
python3 -m venv venv
source ./venv/bin/activate
安装完成后,使用简单的脚本进行验证
import inspect
from collections import defaultdict
import pandas as pd
from torch.utils import benchmark
import torch
pd.options.display.precision = 3
def var_dict(*args):
callers_local_vars = inspect.currentframe().f_back.f_locals.items()
return dict([(name, val) for name, val in callers_local_vars if val is arg][0]
for arg in args)
def walltime(stmt, arg_dict, duration=3):
return benchmark.Timer(stmt=stmt, globals=arg_dict).blocked_autorange(
min_run_time=duration).median
print(torch.cuda.get_device_name(0))
matmul_tflops = defaultdict(lambda: {})
for n in [128, 512]:
for dtype in (torch.float32, torch.float16):
a = torch.randn(n, n, dtype=dtype).cuda()
b = torch.randn(n, n, dtype=dtype).cuda()
t = walltime('a @ b', var_dict(a, b))
matmul_tflops[f'n={n}'][dtype] = 2 * n ** 3 / t / 1e12
del a, b
print(pd.DataFrame(matmul_tflops))
这个脚本干了三件事,一个就是获取显卡名称,另外就是简单跑跑fp32与fp16(也就是大家说的半精度)
(venv) root@pve:/mnt/nvme0n1p5/pytorch# python3 test.py
Traceback (most recent call last):
File "/mnt/nvme0n1p5/pytorch/test.py", line 20, in <module>
print(torch.cuda.get_device_name(0))
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/mnt/nvme0n1p5/pytorch/venv/lib/python3.11/site-packages/torch/cuda/__init__.py", line 414, in get_device_name
return get_device_properties(device).name
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/mnt/nvme0n1p5/pytorch/venv/lib/python3.11/site-packages/torch/cuda/__init__.py", line 444, in get_device_properties
_lazy_init() # will define _get_device_properties
^^^^^^^^^^^^
File "/mnt/nvme0n1p5/pytorch/venv/lib/python3.11/site-packages/torch/cuda/__init__.py", line 293, in _lazy_init
torch._C._cuda_init()
RuntimeError: Found no NVIDIA driver on your system. Please check that you have an NVIDIA GPU and installed a driver from http://www.nvidia.com/Download/index.aspx
(venv) root@pve:/mnt/nvme0n1p5/pytorch# nvidia-smi
Sat Jun 1 23:48:38 2024
+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 550.54.10 Driver Version: 550.54.10 CUDA Version: N/A |
|-----------------------------------------+------------------------+----------------------+
| GPU Name Persistence-M | Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap | Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|=========================================+========================+======================|
| 0 NVIDIA CMP 40HX On | 00000000:01:00.0 Off | N/A |
| 44% 55C P8 21W / 184W | 61MiB / 8192MiB | 0% Default |
| | | N/A |
+-----------------------------------------+------------------------+----------------------+
+-----------------------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=========================================================================================|
| No running processes found |
+-----------------------------------------------------------------------------------------+
如上所示,在有N卡并安装了nVidia vGPU驱动的情况下,cuda相关内容无法被pytorch调用
安装杂种驱动与安装vGPU驱动方法无异,已经提供了相关安装包,运行下面命令即可安装定制过的杂种驱动
安装vGPU驱动的方法可以参考本站旧文章,虽然系统是PVE7版本,但是过程没有变化
./NVIDIA-Linux-x86_64-550.54.14-merged-vgpu-kvm-patched-kernel6.8-OA5500.run -m kernel
记得这个-m kernel 一定要带上,不然不小心装了kernel-open就寄了
显著特征就是,CUDA Version已经不是N/A了,而是展示了当前支持的最高版本12.4
安装完成驱动后还需使用下面命令检查
ls /dev/nvidia* -l
root@pve:~# ls /dev/nvidia* -l
crw-rw-rw- 1 root root 195, 0 Jun 1 23:09 /dev/nvidia0
crw-rw-rw- 1 root root 195, 255 Jun 1 23:09 /dev/nvidiactl
crw-rw-rw- 1 root root 507, 0 Jun 1 23:53 /dev/nvidia-uvm
crw-rw-rw- 1 root root 507, 1 Jun 1 23:53 /dev/nvidia-uvm-tools
crw-rw-rw- 1 root root 508, 1 Jun 2 00:03 /dev/nvidia-vgpu1
/dev/nvidia-caps:
total 0
cr-------- 1 root root 511, 1 Jun 1 23:28 nvidia-cap1
cr--r--r-- 1 root root 511, 2 Jun 1 23:28 nvidia-cap2
如果命令运行后与以上返回不大一致,需要运行
modprobe nvidia-uvm && /usr/bin/nvidia-modprobe -c0 -u
该命令没有回显,执行完成后再次运行命令检查,至少要有nvidia-uvm出现,否则cuda功能工作大概率依旧异常
接下来会验证宿主机与lxc的pytorch调用cuda能力
(venv) root@pve:/mnt/nvme0n1p5/pytorch# nvidia-smi vgpu
Sun Jun 2 00:06:07 2024
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 550.54.14 Driver Version: 550.54.14 |
|---------------------------------+------------------------------+------------+
| GPU Name | Bus-Id | GPU-Util |
| vGPU ID Name | VM ID VM Name | vGPU-Util |
|=================================+==============================+============|
| 0 NVIDIA CMP 40HX | 00000000:01:00.0 | 1% |
| 3251634216 GRID RTX600... | c269... windows10,debug-... | 0% |
+---------------------------------+------------------------------+------------+
(venv) root@pve:/mnt/nvme0n1p5/pytorch# python test.py
NVIDIA CMP 40HX
n=128 n=512
torch.float32 0.392 6.123
torch.float16 0.355 0.943
(venv) root@pve:/mnt/nvme0n1p5/pytorch# uname -a
Linux pve 6.8.4-3-pve #1 SMP PREEMPT_DYNAMIC PMX 6.8.4-3 (2024-05-02T11:55Z) x86_64 GNU/Linux
如上,pytorch功能正常在开启vGPU的情况下被使用
在宿主机直接跑负载也就图一乐
真跑负载还得在虚拟机或者是lxc里面,不过这个本来就是vGPU驱动,相信大家熟得很就不跑了
这里跑跑lxc
这里选用debian-11-standard_11.7-1_amd64.tar.zst这个作为本次案例
反正就是正常创建个lxc就是
在本次案例中,lxc是121号,因此修改配置的命令就是
vim /etc/pve/lxc/121.conf
然后如图所示,添加下面这些
lxc.cgroup2.devices.allow: c *:* rwm
lxc.mount.entry: /dev/nvidia0 dev/nvidia0 none bind,optional,create=file
lxc.mount.entry: /dev/nvidiactl dev/nvidiactl none bind,optional,create=file
lxc.mount.entry: /dev/nvidia-caps dev/nvidia-caps none bind,optional,create=file
lxc.mount.entry: /dev/nvidia-uvm dev/nvidia-uvm none bind,optional,create=file
lxc.mount.entry: /dev/nvidia-uvm-tools dev/nvidia-uvm-tools none bind,optional,create=file
这里需要说明的是,在lxc跑docker还需要添加其他内容
本案例仅说明使用cuda能力的部分,因此没有相关内容
需使用docker请添加docker相关内容
首先运行ls /dev/nvidia* -l 确保该通进来的进来了
随后,从宿主机把杂种驱动复制进来
pct push 121 ./NVIDIA-Linux-x86_64-550.54.14-merged-vgpu-kvm-patched-kernel6.8-OA5500.run /root/NVIDIA-Linux-x86_64-550.54.14-merged-vgpu-kvm-patched-kernel6.8-OA5500.run
命令没有回显,进入lxc中检查并运行
注意这里的参数,是--no-kernel-module
./NVIDIA-Linux-x86_64-550.54.14-merged-vgpu-kvm-patched-kernel6.8-OA5500.run --no-kernel-module
然后一路装就是了
确认一下杂种驱动安装完成
首先换个源
sed -i 's|^deb http://ftp.debian.org|deb https://mirrors.ustc.edu.cn|g' /etc/apt/sources.list
sed -i 's|^deb http://security.debian.org|deb https://mirrors.ustc.edu.cn/debian-security|g' /etc/apt/sources.list
sed -i 's/deb.debian.org/mirrors.ustc.edu.cn/g' /etc/apt/sources.list
apt update
然后还是装pip
apt install python3.11-venv
创建venv,虽然lxc本来就隔离了,但习惯创建了
python3 -m venv venv
source ./venv/bin/activate
再把pytorch装上
pip config set global.index-url https://mirror.sjtu.edu.cn/pypi/web/simple
pip install pandas torch==2.3.0+cu121 -f https://mirror.sjtu.edu.cn/pytorch-wheels/torch_stable.html --no-cache-dir
首先老样子把测试脚本丢进去
pct push 121 /mnt/nvme0n1p5/pytorch/test.py /root/test.py
至于具体目录你放哪里,对应的vmid是什么,自己对着改
(venv) root@CT121:~# ls -l
total 447568
-rwxr-xr-x 1 root root 458301317 Jun 1 16:25 NVIDIA-Linux-x86_64-550.54.14-merged-vgpu-kvm-patched-kernel6.8-OA5500.run
-rw-r--r-- 1 root root 960 Jun 1 16:42 test.py
drwxr-xr-x 6 root root 4096 Jun 1 16:36 venv
然后尝试跑跑测试脚本
(venv) root@CT121:~# ls -l
total 447568
-rwxr-xr-x 1 root root 458301317 Jun 1 16:25 NVIDIA-Linux-x86_64-550.54.14-merged-vgpu-kvm-patched-kernel6.8-OA5500.run
-rw-r--r-- 1 root root 966 Jun 1 16:46 test.py
drwxr-xr-x 6 root root 4096 Jun 1 16:36 venv
(venv) root@CT121:~# python test.py
NVIDIA CMP 40HX
n=128 n=512 n=2048
torch.float32 0.404 5.891 6.457
torch.float16 0.349 0.904 0.917
(venv) root@CT121:~# nvidia-smi vgpu
Sat Jun 1 16:48:55 2024
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 550.54.14 Driver Version: 550.54.14 |
|---------------------------------+------------------------------+------------+
| GPU Name | Bus-Id | GPU-Util |
| vGPU ID Name | VM ID VM Name | vGPU-Util |
|=================================+==============================+============|
| 0 NVIDIA CMP 40HX | 00000000:01:00.0 | 3% |
| 3251634326 GRID RTX600... | c269... windows10,debug-... | 3% |
+---------------------------------+------------------------------+------------+
(venv) root@CT121:~# uname -a
Linux CT121 6.8.4-3-pve #1 SMP PREEMPT_DYNAMIC PMX 6.8.4-3 (2024-05-02T11:55Z) x86_64 GNU/Linux
与虚拟机vGPU共同运行的测试图
通过使用杂种驱动,可以让宿主机不失去cuda能力的同时,启用vGPU
但需要注意,任何宿主机cuda负载都有可能吃显存,因此可能造成虚拟机反而打不开vGPU
因此建议先启动虚拟机后,再在宿主机上运行cuda负载
实际上,并不推荐在宿主机上搞这么多花活,建议还是用虚拟机解决cuda需求