Dgx as a GPU node with gpu operator

Installing Kubernetes on DGX using Kubeadm

Created : 09/03/2023 | on Linux dgx 5.4.0-144-generic #161-Ubuntu SMP Fri Feb 3 14:49:04 UTC 2023 x86_64 x86_64 x86_64 GNU/Linux
Updated : 18/05/2032 | on Linux dgx 5.4.0-144-generic #161-Ubuntu SMP Fri Feb 3 14:49:04 UTC 2023 x86_64 x86_64 x86_64 GNU/Linux
Status: Draft

Writeup

If you had previous installations

drain all nodes

reset kubeadm

sudo kubeadm reset --cri-socket=unix:///var/run/cri-dockerd.sock

rest changes to networking

sudo rm -rf /etc/cni/net.d
rm -rf $HOME/.kube

sudo iptables -P INPUT ACCEPT
sudo iptables -P FORWARD ACCEPT
sudo iptables -P OUTPUT ACCEPT
sudo iptables -t nat -F
sudo iptables -t mangle -F
sudo iptables -F
sudo iptables -X

Fresh install

run the command in the master node (control plane)

sudo kubeadm init --pod-network-cidr=192.168.0.0/16 --cri-socket=unix:///var/run/cri-dockerd.sock

setup

mkdir -p $HOME/.kube
sudo cp -i /etc/kubernetes/admin.conf $HOME/.kube/config
sudo chown $(id -u):$(id -g) $HOME/.kube/config

setup network stuff

install calico

install calico and the install calicoctl

note:

my values.yaml

imagePullSecrets: {}

installation:
  enabled: true
  kubernetesProvider:
  cni:
    type: Calico

apiServer:
  enabled: true

certs:
  node:
    key:
    cert:
    commonName:
  typha:
    key:
    cert:
    commonName:
    caBundle:

resources: {}

tolerations:
- effect: NoExecute
  operator: Exists
- effect: NoSchedule
  operator: Exists

nodeSelector:
  kubernetes.io/os: linux

podAnnotations: {}

podLabels: {}

tigeraOperator:
  image: tigera/operator
  version: v1.29.3
  registry: quay.io
calicoctl:
  image: docker.io/calico/ctl
  tag: v3.25.1

calicoNetwork:
  bgp: Enabled
  ipPools:
  - cidr: 192.168.0.0/16
    encapsulation: VXLAN
    natOutgoing: Enabled
    nodeSelector: all()

confirm all pods are working

watch kubectl get pods -n calico-system

check the IP pool

kubectl calico ipam show

or if the cluster and calicoctl versions do not match

kubectl-calico ipam show --allow-version-mismatch

you will get something like

+----------+----------------+-----------+------------+--------------+
| GROUPING |      CIDR      | IPS TOTAL | IPS IN USE |   IPS FREE   |
+----------+----------------+-----------+------------+--------------+
| IP Pool  | 192.168.0.0/16 |     65536 | 7 (0%)     | 65529 (100%) |
+----------+----------------+-----------+------------+--------------+

get calicotl version pods with kubectl-calico version or calicoctl version

The aim is to make sure the cluster and the client have the same version I get something

kubectl-calico version
Client Version:    v3.25.1
Git commit:        82dadbce1
Cluster Version:   v3.25.1
Cluster Type:      typha,kdd,k8s,operator,bgp,kubeadm

in the worker node

if it was previously used ssh into that and run

sudo kubeadm reset

to reset the node, then clean up networking configs

sudo rm -rf /etc/cni/net.d

once it is all clean, reset kubeadm

sudo kubeadm reset

if that fails

sudo systemctl stop kubelet
sudo systemctl stop <container-runtime-service> (containerd or dockerd)

sudo rm -rf /etc/kubernetes
sudo rm -rf /var/lib/kubelet
sudo rm -rf /var/lib/etcd
sudo rm -rf /var/lib/cni
sudo rm -rf /etc/cni/net.d
sudo rm -rf /var/run/kubernetes
sudo rm -rf /var/lib/docker
sudo rm -rf /var/lib/containerd

then 

sudo reboot and try kubeadm reset again

Then get a join token from the master node and apply to the user node

kubeadm token create --print-join-command

to be able to use kubectl from the worker node copy the $HOME/.kube/config to the worker (optional)

modified /etc/containerd/config.toml (to be able to use the gpu operator)

disabled_plugins = []
imports = []
oom_score = 0
plugin_dir = ""
required_plugins = []
root = "/var/lib/containerd"
state = "/run/containerd"
temp = ""
version = 2

[cgroup]
  path = ""

[debug]
  address = ""
  format = ""
  gid = 0
  level = ""
  uid = 0

[grpc]
  address = "/run/containerd/containerd.sock"
  gid = 0
  max_recv_message_size = 16777216
  max_send_message_size = 16777216
  tcp_address = ""
  tcp_tls_ca = ""
  tcp_tls_cert = ""
  tcp_tls_key = ""
  uid = 0

[metrics]
  address = ""
  grpc_histogram = false

[plugins]

  [plugins."io.containerd.gc.v1.scheduler"]
    deletion_threshold = 0
    mutation_threshold = 100
    pause_threshold = 0.02
    schedule_delay = "0s"
    startup_delay = "100ms"

  [plugins."io.containerd.grpc.v1.cri"]
    device_ownership_from_security_context = false
    disable_apparmor = false
    disable_cgroup = false
    disable_hugetlb_controller = true
    disable_proc_mount = false
    disable_tcp_service = true
    enable_selinux = false
    enable_tls_streaming = false
    enable_unprivileged_icmp = false
    enable_unprivileged_ports = false
    ignore_image_defined_volumes = false
    max_concurrent_downloads = 3
    max_container_log_line_size = 16384
    netns_mounts_under_state_dir = false
    restrict_oom_score_adj = false
    sandbox_image = "registry.k8s.io/pause:3.6"
    selinux_category_range = 1024
    stats_collect_period = 10
    stream_idle_timeout = "4h0m0s"
    stream_server_address = "127.0.0.1"
    stream_server_port = "0"
    systemd_cgroup = false
    tolerate_missing_hugetlb_controller = true
    unset_seccomp_profile = ""

    [plugins."io.containerd.grpc.v1.cri".cni]
      bin_dir = "/opt/cni/bin"
      conf_dir = "/etc/cni/net.d"
      conf_template = ""
      ip_pref = ""
      max_conf_num = 1

    [plugins."io.containerd.grpc.v1.cri".containerd]
      default_runtime_name = "nvidia"
      disable_snapshot_annotations = true
      discard_unpacked_layers = false
      ignore_rdt_not_enabled_errors = false
      no_pivot = false
      snapshotter = "overlayfs"

      [plugins."io.containerd.grpc.v1.cri".containerd.default_runtime]
        base_runtime_spec = ""
        cni_conf_dir = ""
        cni_max_conf_num = 0
        container_annotations = []
        pod_annotations = []
        privileged_without_host_devices = false
        runtime_engine = ""
        runtime_path = ""
        runtime_root = ""
        runtime_type = ""

      [plugins."io.containerd.grpc.v1.cri".containerd.runtimes.nvidia]
        privileged_without_host_devices = false
        runtime_engine = ""
        runtime_root = ""
    runtime_type = "io.containerd.runc.v2"
      [plugins."io.containerd.grpc.v1.cri".containerd.runtimes.nvidia.options]
    BinaryName = "/usr/bin/nvidia-container-runtime"
    SystemdCgroup = true
      

        [plugins."io.containerd.grpc.v1.cri".containerd.default_runtime.options]

      [plugins."io.containerd.grpc.v1.cri".containerd.runtimes]

        [plugins."io.containerd.grpc.v1.cri".containerd.runtimes.runc]
          base_runtime_spec = ""
          cni_conf_dir = ""
          cni_max_conf_num = 0
          container_annotations = []
          pod_annotations = []
          privileged_without_host_devices = false
          runtime_engine = ""
          runtime_path = ""
          runtime_root = ""
          runtime_type = "io.containerd.runc.v2"

          [plugins."io.containerd.grpc.v1.cri".containerd.runtimes.runc.options]
            BinaryName = ""
            CriuImagePath = ""
            CriuPath = ""
            CriuWorkPath = ""
            IoGid = 0
            IoUid = 0
            NoNewKeyring = false
            NoPivotRoot = false
            Root = ""
            ShimCgroup = ""
            SystemdCgroup = true

      [plugins."io.containerd.grpc.v1.cri".containerd.untrusted_workload_runtime]
        base_runtime_spec = ""
        cni_conf_dir = ""
        cni_max_conf_num = 0
        container_annotations = []
        pod_annotations = []
        privileged_without_host_devices = false
        runtime_engine = ""
        runtime_path = ""
        runtime_root = ""
        runtime_type = ""

        [plugins."io.containerd.grpc.v1.cri".containerd.untrusted_workload_runtime.options]

    [plugins."io.containerd.grpc.v1.cri".image_decryption]
      key_model = "node"

    [plugins."io.containerd.grpc.v1.cri".registry]
      config_path = ""

      [plugins."io.containerd.grpc.v1.cri".registry.auths]

      [plugins."io.containerd.grpc.v1.cri".registry.configs]

      [plugins."io.containerd.grpc.v1.cri".registry.headers]

      [plugins."io.containerd.grpc.v1.cri".registry.mirrors]

    [plugins."io.containerd.grpc.v1.cri".x509_key_pair_streaming]
      tls_cert_file = ""
      tls_key_file = ""

  [plugins."io.containerd.internal.v1.opt"]
    path = "/opt/containerd"

  [plugins."io.containerd.internal.v1.restart"]
    interval = "10s"

  [plugins."io.containerd.internal.v1.tracing"]
    sampling_ratio = 1.0
    service_name = "containerd"

  [plugins."io.containerd.metadata.v1.bolt"]
    content_sharing_policy = "shared"

  [plugins."io.containerd.monitor.v1.cgroups"]
    no_prometheus = false

  [plugins."io.containerd.runtime.v1.linux"]
    no_shim = false
    runtime = "runc"
    runtime_root = ""
    shim = "containerd-shim"
    shim_debug = false

  [plugins."io.containerd.runtime.v2.task"]
    platforms = ["linux/amd64"]
    sched_core = false

  [plugins."io.containerd.service.v1.diff-service"]
    default = ["walking"]

  [plugins."io.containerd.service.v1.tasks-service"]
    rdt_config_file = ""

  [plugins."io.containerd.snapshotter.v1.aufs"]
    root_path = ""

  [plugins."io.containerd.snapshotter.v1.btrfs"]
    root_path = ""

  [plugins."io.containerd.snapshotter.v1.devmapper"]
    async_remove = false
    base_image_size = ""
    discard_blocks = false
    fs_options = ""
    fs_type = ""
    pool_name = ""
    root_path = ""

  [plugins."io.containerd.snapshotter.v1.native"]
    root_path = ""

  [plugins."io.containerd.snapshotter.v1.overlayfs"]
    root_path = ""
    upperdir_label = false

  [plugins."io.containerd.snapshotter.v1.zfs"]
    root_path = ""

  [plugins."io.containerd.tracing.processor.v1.otlp"]
    endpoint = ""
    insecure = false
    protocol = ""

[proxy_plugins]

[stream_processors]

  [stream_processors."io.containerd.ocicrypt.decoder.v1.tar"]
    accepts = ["application/vnd.oci.image.layer.v1.tar+encrypted"]
    args = ["--decryption-keys-path", "/etc/containerd/ocicrypt/keys"]
    env = ["OCICRYPT_KEYPROVIDER_CONFIG=/etc/containerd/ocicrypt/ocicrypt_keyprovider.conf"]
    path = "ctd-decoder"
    returns = "application/vnd.oci.image.layer.v1.tar"

  [stream_processors."io.containerd.ocicrypt.decoder.v1.tar.gzip"]
    accepts = ["application/vnd.oci.image.layer.v1.tar+gzip+encrypted"]
    args = ["--decryption-keys-path", "/etc/containerd/ocicrypt/keys"]
    env = ["OCICRYPT_KEYPROVIDER_CONFIG=/etc/containerd/ocicrypt/ocicrypt_keyprovider.conf"]
    path = "ctd-decoder"
    returns = "application/vnd.oci.image.layer.v1.tar+gzip"

[timeouts]
  "io.containerd.timeout.bolt.open" = "0s"
  "io.containerd.timeout.shim.cleanup" = "5s"
  "io.containerd.timeout.shim.load" = "5s"
  "io.containerd.timeout.shim.shutdown" = "3s"
  "io.containerd.timeout.task.state" = "2s"

[ttrpc]
  address = ""
  gid = 0
  uid = 0

then add the dgx node with the join command

install gpu operator with these instructions

Note: I’ve used the config file as my /etc/containerd/config.toml

when working gpu node

kubectl describe node dgx
Name:               dgx
Roles:              <none>
Labels:             beta.kubernetes.io/arch=amd64
                    beta.kubernetes.io/os=linux
                    feature.node.kubernetes.io/cpu-cpuid.ADX=true
                    feature.node.kubernetes.io/cpu-cpuid.AESNI=true
                    feature.node.kubernetes.io/cpu-cpuid.AVX=true
                    feature.node.kubernetes.io/cpu-cpuid.AVX2=true
                    feature.node.kubernetes.io/cpu-cpuid.CLZERO=true
                    feature.node.kubernetes.io/cpu-cpuid.CPBOOST=true
                    feature.node.kubernetes.io/cpu-cpuid.FMA3=true
                    feature.node.kubernetes.io/cpu-cpuid.IBS=true
                    feature.node.kubernetes.io/cpu-cpuid.IBSBRNTRGT=true
                    feature.node.kubernetes.io/cpu-cpuid.IBSFETCHSAM=true
                    feature.node.kubernetes.io/cpu-cpuid.IBSFFV=true
                    feature.node.kubernetes.io/cpu-cpuid.IBSOPCNT=true
                    feature.node.kubernetes.io/cpu-cpuid.IBSOPCNTEXT=true
                    feature.node.kubernetes.io/cpu-cpuid.IBSOPSAM=true
                    feature.node.kubernetes.io/cpu-cpuid.IBSRDWROPCNT=true
                    feature.node.kubernetes.io/cpu-cpuid.IBSRIPINVALIDCHK=true
                    feature.node.kubernetes.io/cpu-cpuid.INT_WBINVD=true
                    feature.node.kubernetes.io/cpu-cpuid.MCAOVERFLOW=true
                    feature.node.kubernetes.io/cpu-cpuid.MCOMMIT=true
                    feature.node.kubernetes.io/cpu-cpuid.MSRIRC=true
                    feature.node.kubernetes.io/cpu-cpuid.RDPRU=true
                    feature.node.kubernetes.io/cpu-cpuid.SHA=true
                    feature.node.kubernetes.io/cpu-cpuid.SSE4A=true
                    feature.node.kubernetes.io/cpu-cpuid.SUCCOR=true
                    feature.node.kubernetes.io/cpu-cpuid.WBNOINVD=true
                    feature.node.kubernetes.io/cpu-hardware_multithreading=true
                    feature.node.kubernetes.io/cpu-rdt.RDTCMT=true
                    feature.node.kubernetes.io/cpu-rdt.RDTL3CA=true
                    feature.node.kubernetes.io/cpu-rdt.RDTMBM=true
                    feature.node.kubernetes.io/cpu-rdt.RDTMON=true
                    feature.node.kubernetes.io/custom-rdma.available=true
                    feature.node.kubernetes.io/kernel-config.NO_HZ=true
                    feature.node.kubernetes.io/kernel-config.NO_HZ_IDLE=true
                    feature.node.kubernetes.io/kernel-version.full=5.4.0-146-generic
                    feature.node.kubernetes.io/kernel-version.major=5
                    feature.node.kubernetes.io/kernel-version.minor=4
                    feature.node.kubernetes.io/kernel-version.revision=0
                    feature.node.kubernetes.io/network-sriov.capable=true
                    feature.node.kubernetes.io/pci-10de.present=true
                    feature.node.kubernetes.io/pci-10de.sriov.capable=true
                    feature.node.kubernetes.io/pci-1a03.present=true
                    feature.node.kubernetes.io/pci-8086.present=true
                    feature.node.kubernetes.io/pci-8086.sriov.capable=true
                    feature.node.kubernetes.io/storage-nonrotationaldisk=true
                    feature.node.kubernetes.io/system-os_release.ID=ubuntu
                    feature.node.kubernetes.io/system-os_release.VERSION_ID=20.04
                    feature.node.kubernetes.io/system-os_release.VERSION_ID.major=20
                    feature.node.kubernetes.io/system-os_release.VERSION_ID.minor=04
                    kubernetes.io/arch=amd64
                    kubernetes.io/hostname=dgx
                    kubernetes.io/os=linux
                    nvidia.com/cuda.driver.major=525
                    nvidia.com/cuda.driver.minor=85
                    nvidia.com/cuda.driver.rev=12
                    nvidia.com/cuda.runtime.major=12
                    nvidia.com/cuda.runtime.minor=0
                    nvidia.com/gfd.timestamp=1680714586
                    nvidia.com/gpu.compute.major=8
                    nvidia.com/gpu.compute.minor=0
                    nvidia.com/gpu.count=4
                    nvidia.com/gpu.deploy.container-toolkit=true
                    nvidia.com/gpu.deploy.dcgm=true
                    nvidia.com/gpu.deploy.dcgm-exporter=true
                    nvidia.com/gpu.deploy.device-plugin=true
                    nvidia.com/gpu.deploy.driver=true
                    nvidia.com/gpu.deploy.gpu-feature-discovery=true
                    nvidia.com/gpu.deploy.mig-manager=true
                    nvidia.com/gpu.deploy.node-status-exporter=true
                    nvidia.com/gpu.deploy.operator-validator=true
                    nvidia.com/gpu.family=ampere
                    nvidia.com/gpu.machine=DGX-Station-A100-920-23487-2530-0R0
                    nvidia.com/gpu.memory=40960
                    nvidia.com/gpu.present=true
                    nvidia.com/gpu.product=NVIDIA-A100-SXM4-40GB
                    nvidia.com/gpu.replicas=1
                    nvidia.com/mig.capable=true
                    nvidia.com/mig.strategy=single
Annotations:        csi.volume.kubernetes.io/nodeid: {"csi.tigera.io":"dgx"}
                    kubeadm.alpha.kubernetes.io/cri-socket: unix:///var/run/containerd/containerd.sock
                    nfd.node.kubernetes.io/extended-resources: 
                    nfd.node.kubernetes.io/feature-labels:
                      cpu-cpuid.ADX,cpu-cpuid.AESNI,cpu-cpuid.AVX,cpu-cpuid.AVX2,cpu-cpuid.CLZERO,cpu-cpuid.CPBOOST,cpu-cpuid.FMA3,cpu-cpuid.IBS,cpu-cpuid.IBSBR...
                    nfd.node.kubernetes.io/worker.version: v0.10.1
                    node.alpha.kubernetes.io/ttl: 0
                    nvidia.com/gpu-driver-upgrade-enabled: true
                    projectcalico.org/IPv4Address: 172.16.3.2/24
                    projectcalico.org/IPv4IPIPTunnelAddr: 192.168.251.129
                    volumes.kubernetes.io/controller-managed-attach-detach: true
CreationTimestamp:  Wed, 05 Apr 2023 17:59:53 +0100
Taints:             <none>
Unschedulable:      false
Lease:
  HolderIdentity:  dgx
  AcquireTime:     <unset>
  RenewTime:       Wed, 05 Apr 2023 18:10:15 +0100
Conditions:
  Type                 Status  LastHeartbeatTime                 LastTransitionTime                Reason                       Message
  ----                 ------  -----------------                 ------------------                ------                       -------
  NetworkUnavailable   False   Wed, 05 Apr 2023 18:00:42 +0100   Wed, 05 Apr 2023 18:00:42 +0100   CalicoIsUp                   Calico is running on this node
  MemoryPressure       False   Wed, 05 Apr 2023 18:10:16 +0100   Wed, 05 Apr 2023 17:59:53 +0100   KubeletHasSufficientMemory   kubelet has sufficient memory available
  DiskPressure         False   Wed, 05 Apr 2023 18:10:16 +0100   Wed, 05 Apr 2023 17:59:53 +0100   KubeletHasNoDiskPressure     kubelet has no disk pressure
  PIDPressure          False   Wed, 05 Apr 2023 18:10:16 +0100   Wed, 05 Apr 2023 17:59:53 +0100   KubeletHasSufficientPID      kubelet has sufficient PID available
  Ready                True    Wed, 05 Apr 2023 18:10:16 +0100   Wed, 05 Apr 2023 18:00:25 +0100   KubeletReady                 kubelet is posting ready status. AppArmor enabled
Addresses:
  InternalIP:  172.16.3.2
  Hostname:    dgx
Capacity:
  cpu:                128
  ephemeral-storage:  1843269236Ki
  hugepages-1Gi:      0
  hugepages-2Mi:      0
  memory:             528018228Ki
  nvidia.com/gpu:     4
  pods:               110
Allocatable:
  cpu:                128
  ephemeral-storage:  1698756925085
  hugepages-1Gi:      0
  hugepages-2Mi:      0
  memory:             527915828Ki
  nvidia.com/gpu:     4
  pods:               110
System Info:
  Machine ID:                 fe86cb15be594307a11cc9847b0eb5c2
  System UUID:                21af0608-1dd2-11b2-9c02-f24e4f55ad5c
  Boot ID:                    8856cf0b-f468-4533-a116-69cb5e55cd4e
  Kernel Version:             5.4.0-146-generic
  OS Image:                   Ubuntu 20.04.6 LTS
  Operating System:           linux
  Architecture:               amd64
  Container Runtime Version:  containerd://1.6.16
  Kubelet Version:            v1.26.3
  Kube-Proxy Version:         v1.26.3
PodCIDR:                      192.168.1.0/24
PodCIDRs:                     192.168.1.0/24
Non-terminated Pods:          (10 in total)
  Namespace                   Name                                                           CPU Requests  CPU Limits  Memory Requests  Memory Limits  Age
  ---------                   ----                                                           ------------  ----------  ---------------  -------------  ---
  calico-system               calico-node-xxglb                                              0 (0%)        0 (0%)      0 (0%)           0 (0%)         10m
  calico-system               csi-node-driver-85l8g                                          0 (0%)        0 (0%)      0 (0%)           0 (0%)         10m
  gpu-operator                gpu-feature-discovery-66tqt                                    0 (0%)        0 (0%)      0 (0%)           0 (0%)         3m41s
  gpu-operator                gpu-operator-1680714348-node-feature-discovery-worker-65n8z    0 (0%)        0 (0%)      0 (0%)           0 (0%)         4m28s
  gpu-operator                gpu-operator-66874759cd-rfgb6                                  200m (0%)     500m (0%)   100Mi (0%)       350Mi (0%)     4m28s
  gpu-operator                nvidia-dcgm-exporter-cxt85                                     0 (0%)        0 (0%)      0 (0%)           0 (0%)         3m42s
  gpu-operator                nvidia-device-plugin-daemonset-j2wmg                           0 (0%)        0 (0%)      0 (0%)           0 (0%)         3m42s
  gpu-operator                nvidia-mig-manager-74h6j                                       0 (0%)        0 (0%)      0 (0%)           0 (0%)         13s
  gpu-operator                nvidia-operator-validator-246wq                                0 (0%)        0 (0%)      0 (0%)           0 (0%)         3m42s
  kube-system                 kube-proxy-hjlbj                                               0 (0%)        0 (0%)      0 (0%)           0 (0%)         10m
Allocated resources:
  (Total limits may be over 100 percent, i.e., overcommitted.)
  Resource           Requests    Limits
  --------           --------    ------
  cpu                200m (0%)   500m (0%)
  memory             100Mi (0%)  350Mi (0%)
  ephemeral-storage  0 (0%)      0 (0%)
  hugepages-1Gi      0 (0%)      0 (0%)
  hugepages-2Mi      0 (0%)      0 (0%)
  nvidia.com/gpu     0           0
Events:
  Type    Reason                   Age                From             Message
  ----    ------                   ----               ----             -------
  Normal  Starting                 10m                kube-proxy       
  Normal  NodeHasSufficientMemory  10m (x8 over 10m)  kubelet          Node dgx status is now: NodeHasSufficientMemory
  Normal  RegisteredNode           10m                node-controller  Node dgx event: Registered Node dgx in Controller

when working get pods -A

NAMESPACE          NAME                                                              READY   STATUS      RESTARTS   AGE
calico-apiserver   calico-apiserver-86f7d9859f-qd8l6                                 1/1     Running     0          58m
calico-apiserver   calico-apiserver-86f7d9859f-sb6dt                                 1/1     Running     0          58m
calico-system      calico-kube-controllers-6bb86c78b4-hhbgj                          1/1     Running     0          59m
calico-system      calico-node-l89c6                                                 1/1     Running     0          59m
calico-system      calico-node-xxglb                                                 1/1     Running     0          12m
calico-system      calico-typha-6d655d787d-2n9jf                                     1/1     Running     0          59m
calico-system      csi-node-driver-85l8g                                             2/2     Running     0          12m
calico-system      csi-node-driver-nd6d4                                             2/2     Running     0          59m
gpu-operator       gpu-feature-discovery-66tqt                                       1/1     Running     0          5m47s
gpu-operator       gpu-operator-1680714348-node-feature-discovery-master-5cbdmv29s   1/1     Running     0          6m34s
gpu-operator       gpu-operator-1680714348-node-feature-discovery-worker-65n8z       1/1     Running     0          6m34s
gpu-operator       gpu-operator-66874759cd-rfgb6                                     1/1     Running     0          6m34s
gpu-operator       nvidia-cuda-validator-zlrtf                                       0/1     Completed   0          5m33s
gpu-operator       nvidia-dcgm-exporter-cxt85                                        1/1     Running     0          5m48s
gpu-operator       nvidia-device-plugin-daemonset-j2wmg                              1/1     Running     0          5m48s
gpu-operator       nvidia-device-plugin-validator-fwqmw                              0/1     Completed   0          2m30s
gpu-operator       nvidia-mig-manager-74h6j                                          1/1     Running     0          2m19s
gpu-operator       nvidia-operator-validator-246wq                                   1/1     Running     0          5m48s
kube-system        coredns-787d4945fb-2gzjg                                          1/1     Running     0          61m
kube-system        coredns-787d4945fb-vcxs9                                          1/1     Running     0          61m
kube-system        etcd-gsrv                                                         1/1     Running     0          61m
kube-system        kube-apiserver-gsrv                                               1/1     Running     0          61m
kube-system        kube-controller-manager-gsrv                                      1/1     Running     0          61m
kube-system        kube-proxy-hjlbj                                                  1/1     Running     0          12m
kube-system        kube-proxy-zwlrb                                                  1/1     Running     0          61m
kube-system        kube-scheduler-gsrv                                               1/1     Running     0          61m
tigera-operator    tigera-operator-5d6845b496-vk27g                                  1/1     Running     0          59m

Note: When restarting the “node-feature-discovery-worker” pod might go into an unstable state, I ended up deleting ti so the master can allocate a new one. (but then it seem to go to a loopif Error -> CrashLoopBackOff -> Running)

then I uninstalled the gpu operator helm chart

helm uninstall --namespace gpu-operator gpu-operator-1680714348

and then reinstalled it

helm install --wait --generate-name \
     -n gpu-operator --create-namespace \
      nvidia/gpu-operator \
      --set driver.enabled=false \
      --set toolkit.enabled=false

but I still keep getting the previous error

PS.

helper pod, with this we can start a pod in a specific node (eg. gsrv or dgx) and run commands

kubectl run -it --rm --restart=Never --image=busybox --overrides='{"apiVersion": "v1", "spec": {"nodeName": "gsrv"}}' network-test -- sh

you can pull an ubuntu container too

kubectl run -it --rm --restart=Never --image=ubuntu:20.04 --overrides='{"spec": {"nodeName": "dgx"}}' network-test -- /bin/sh

I noticed that name resolution is intermittent

I think this is why the node feature discovery pod is failing (maybe gsrv (my control plane node) is not powerful enough?)

the command below was run on a ubuntu test pod spun up on the dgx I ran the command traceroute gpu-operator-1680775130-node-feature-discovery-master.gpu-operator.svc.cluster.local

on a pod on gsrv and another on dgx to test if they work but they did not work all the time, So I felt the issue was intermittent.

 hostname
network-test-on-dgx
# traceroute gpu-operator-1680775130-node-feature-discovery-master.gpu-operator.svc.cluster.local
traceroute: unknown host
# traceroute gpu-operator-1680775130-node-feature-discovery-master.gpu-operator.svc.cluster.local
traceroute to gpu-operator-1680775130-node-feature-discovery-master.gpu-operator.svc.cluster.local (10.106.146.91), 64 hops max
  1   172.16.3.2  0.003ms  0.001ms  0.001ms 
  2   172.16.3.1  0.891ms  0.495ms  0.411ms 
  3   164.39.255.82  7.398ms  7.132ms  6.870ms 
  4   *  *  * 
  5   *  *  * 
  6   *  *  * 
  7   * 

more debugging

apiVersion: v1
kind: Pod
metadata:
  name: dnsutils-gsrv
  namespace: default
spec:
  nodeName: gsrv
  containers:
  - name: dnsutils
    image: registry.k8s.io/e2e-test-images/jessie-dnsutils:1.3
    command:
      - sleep
      - "infinity"
    imagePullPolicy: IfNotPresent
  restartPolicy: Never

and then

kubectl exec -it dnsutils-gsrv -- /bin/bash

then

traceroute  gpu-operator-1680775130-node-feature-discovery-master.gpu-operator.svc.cluster.local

or (notice the dot (root of the dns hierachy))

nslookup gpu-operator-1680775130-node-feature-discovery-master.gpu-operator.svc.cluster.local
nslookup gpu-operator-1680775130-node-feature-discovery-master.gpu-operator.svc.cluster.local.

fully qualified names (ending with a dot) resolve faster and nslookup returns quickly

in out case nslookup wont even return anything if we don’y put the dot to the end.

Check the next topic

Click here to report Errors, make Suggestions or Comments!