安装 NVIDIA 驱动
略
安装成功后可使用nvidia-smi
命令查看驱动版本和 cuda 版本
$ nvidia-smi
Thu Oct 12 11:29:15 2023
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 470.161.03 Driver Version: 470.161.03 CUDA Version: 11.4 |
|-------------------------------+----------------------+----------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|===============================+======================+======================|
| 0 NVIDIA A10-4Q On | 00000000:00:07.0 Off | N/A |
| N/A N/A P0 N/A / N/A | 128MiB / 3932MiB | 0% Default |
| | | N/A |
+-------------------------------+----------------------+----------------------+
+-----------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=============================================================================|
| No running processes found |
+-----------------------------------------------------------------------------+
安装 nvidia-container-runtime
- 添加 Apt 仓库
curl -fsSL https://nvidia.github.io/libnvidia-container/gpgkey | sudo gpg --dearmor -o /usr/share/keyrings/nvidia-container-toolkit-keyring.gpg \
&& curl -s -L https://nvidia.github.io/libnvidia-container/stable/deb/nvidia-container-toolkit.list | \
sed 's#deb https://#deb [signed-by=/usr/share/keyrings/nvidia-container-toolkit-keyring.gpg] https://#g' | \
sudo tee /etc/apt/sources.list.d/nvidia-container-toolkit.list \
&& \
sudo apt-get update
- 安装 nvidia-container-runtime
apt install -y nvidia-container-runtime
安装 K3s
已经安装 K3s 只需要重启下 K3s:
systemctl restart k3s
curl -ksL get.k3s.io | sh -
更多安装文档参考https://docs.k3s.io/zh/installation
确定 K3s 已经找到 nvidia-container-runtime
grep nvidia /var/lib/rancher/k3s/agent/etc/containerd/config.toml
输出应该如下:
[plugins."io.containerd.grpc.v1.cri".containerd.runtimes."nvidia"]
[plugins."io.containerd.grpc.v1.cri".containerd.runtimes."nvidia".options]
BinaryName = "/usr/bin/nvidia-container-runtime"
安装 nvidia-device-plugin
nvidia-device-plugin 可以将 gpu 作为 kubenetes 资源使用
创建 RuntimeClass
apiVersion: node.k8s.io/v1 kind: RuntimeClass metadata: name: nvidia handler: nvidia
添加 Helm 仓库
helm repo add nvdp https://nvidia.github.io/k8s-device-plugin helm repo update
安装
helm upgrade -i nvdp nvdp/nvidia-device-plugin \ --namespace nvidia-device-plugin \ --set runtimeClassName=nvidia \ --create-namespace \ --version 0.14.1
--set runtimeClassName=nvidia
是必需的,因为 K3s 自动发现 nvidia-container-runtime 不会将其配置为默认运行时
验证
apiVersion: node.k8s.io/v1
kind: RuntimeClass
metadata:
name: nvidia
handler: nvidia
---
apiVersion: v1
kind: Pod
metadata:
name: nbody-gpu-benchmark
namespace: default
spec:
restartPolicy: OnFailure
runtimeClassName: nvidia
containers:
- name: cuda-container
image: nvcr.io/nvidia/k8s/cuda-sample:nbody-cuda11.2.1
args: ["nbody", "-gpu", "-benchmark"]
resources:
limits:
nvidia.com/gpu: 1
env:
- name: NVIDIA_VISIBLE_DEVICES
value: all
- name: NVIDIA_DRIVER_CAPABILITIES
value: all
你需要根据实际的 cuda 版本修改镜像 tag 才能正确启动 Pod,你可以在这里找到 cuda-sample 镜像的所有 tag
使用kubectl logs
指令查看 Pod 输出
$ kubectl logs -f nbody-gpu-benchmark
Run "nbody -benchmark [-numbodies=<numBodies>]" to measure performance.
-fullscreen (run n-body simulation in fullscreen mode)
-fp64 (use double precision floating point values for simulation)
-hostmem (stores simulation data in host memory)
-benchmark (run benchmark to measure performance)
-numbodies=<N> (number of bodies (>= 1) to run in simulation)
-device=<d> (where d=0,1,2.... for the CUDA device to use)
-numdevices=<i> (where i=(number of CUDA devices > 0) to use for simulation)
-compare (compares simulation results running once on the default GPU and once on the CPU)
-cpu (run n-body simulation on the CPU)
-tipsy=<file.bin> (load a tipsy model file for simulation)
NOTE: The CUDA Samples are not meant for performance measurements. Results may vary when GPU Boost is enabled.
> Windowed mode
> Simulation data stored in video memory
> Single precision floating point simulation
> 1 Devices used for simulation
GPU Device 0: "Ampere" with compute capability 8.6
> Compute 8.6 CUDA device: [NVIDIA A10-4Q]
73728 bodies, total time for 10 iterations: 478.497 ms
= 113.602 billion interactions per second
= 2272.040 single-precision GFLOP/s at 20 flops per interaction
成功!