安装 NVIDIA 驱动

安装成功后可使用nvidia-smi命令查看驱动版本和 cuda 版本

$ nvidia-smi
Thu Oct 12 11:29:15 2023
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 470.161.03   Driver Version: 470.161.03   CUDA Version: 11.4     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|===============================+======================+======================|
|   0  NVIDIA A10-4Q       On   | 00000000:00:07.0 Off |                  N/A |
| N/A   N/A    P0    N/A /  N/A |    128MiB /  3932MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+

+-----------------------------------------------------------------------------+
| Processes:                                                                  |
|  GPU   GI   CI        PID   Type   Process name                  GPU Memory |
|        ID   ID                                                   Usage      |
|=============================================================================|
|  No running processes found                                                 |
+-----------------------------------------------------------------------------+

安装 nvidia-container-runtime

  1. 添加 Apt 仓库
curl -fsSL https://nvidia.github.io/libnvidia-container/gpgkey | sudo gpg --dearmor -o /usr/share/keyrings/nvidia-container-toolkit-keyring.gpg \
  && curl -s -L https://nvidia.github.io/libnvidia-container/stable/deb/nvidia-container-toolkit.list | \
    sed 's#deb https://#deb [signed-by=/usr/share/keyrings/nvidia-container-toolkit-keyring.gpg] https://#g' | \
    sudo tee /etc/apt/sources.list.d/nvidia-container-toolkit.list \
  && \
    sudo apt-get update
  1. 安装 nvidia-container-runtime
apt install -y nvidia-container-runtime

安装 K3s

已经安装 K3s 只需要重启下 K3s:systemctl restart k3s

curl -ksL get.k3s.io | sh -

更多安装文档参考https://docs.k3s.io/zh/installation

确定 K3s 已经找到 nvidia-container-runtime

grep nvidia /var/lib/rancher/k3s/agent/etc/containerd/config.toml

输出应该如下:

[plugins."io.containerd.grpc.v1.cri".containerd.runtimes."nvidia"]
[plugins."io.containerd.grpc.v1.cri".containerd.runtimes."nvidia".options]
  BinaryName = "/usr/bin/nvidia-container-runtime"

安装 nvidia-device-plugin

nvidia-device-plugin 可以将 gpu 作为 kubenetes 资源使用

  1. 创建 RuntimeClass

    apiVersion: node.k8s.io/v1
    kind: RuntimeClass
    metadata:
      name: nvidia
    handler: nvidia
    
  2. 添加 Helm 仓库

    helm repo add nvdp https://nvidia.github.io/k8s-device-plugin
    helm repo update
    
  3. 安装

    helm upgrade -i nvdp nvdp/nvidia-device-plugin \
      --namespace nvidia-device-plugin \
      --set runtimeClassName=nvidia \
      --create-namespace \
      --version 0.14.1
    

    --set runtimeClassName=nvidia 是必需的,因为 K3s 自动发现 nvidia-container-runtime 不会将其配置为默认运行时

验证

apiVersion: node.k8s.io/v1
kind: RuntimeClass
metadata:
  name: nvidia
handler: nvidia
---
apiVersion: v1
kind: Pod
metadata:
  name: nbody-gpu-benchmark
  namespace: default
spec:
  restartPolicy: OnFailure
  runtimeClassName: nvidia
  containers:
    - name: cuda-container
      image: nvcr.io/nvidia/k8s/cuda-sample:nbody-cuda11.2.1
      args: ["nbody", "-gpu", "-benchmark"]
      resources:
        limits:
          nvidia.com/gpu: 1
      env:
        - name: NVIDIA_VISIBLE_DEVICES
          value: all
        - name: NVIDIA_DRIVER_CAPABILITIES
          value: all

你需要根据实际的 cuda 版本修改镜像 tag 才能正确启动 Pod,你可以在这里找到 cuda-sample 镜像的所有 tag

使用kubectl logs指令查看 Pod 输出

$ kubectl logs -f nbody-gpu-benchmark
Run "nbody -benchmark [-numbodies=<numBodies>]" to measure performance.
        -fullscreen       (run n-body simulation in fullscreen mode)
        -fp64             (use double precision floating point values for simulation)
        -hostmem          (stores simulation data in host memory)
        -benchmark        (run benchmark to measure performance)
        -numbodies=<N>    (number of bodies (>= 1) to run in simulation)
        -device=<d>       (where d=0,1,2.... for the CUDA device to use)
        -numdevices=<i>   (where i=(number of CUDA devices > 0) to use for simulation)
        -compare          (compares simulation results running once on the default GPU and once on the CPU)
        -cpu              (run n-body simulation on the CPU)
        -tipsy=<file.bin> (load a tipsy model file for simulation)

NOTE: The CUDA Samples are not meant for performance measurements. Results may vary when GPU Boost is enabled.

> Windowed mode
> Simulation data stored in video memory
> Single precision floating point simulation
> 1 Devices used for simulation
GPU Device 0: "Ampere" with compute capability 8.6

> Compute 8.6 CUDA device: [NVIDIA A10-4Q]
73728 bodies, total time for 10 iterations: 478.497 ms
= 113.602 billion interactions per second
= 2272.040 single-precision GFLOP/s at 20 flops per interaction

成功!