Installing Kubernetes on DGX using Kubeadm PyTorch

Created : 09/03/2023 | on Linux dgx 5.4.0-144-generic #161-Ubuntu SMP Fri Feb 3 14:49:04 UTC 2023 x86_64 x86_64 x86_64 GNU/Linux
Updated : 09/03/2032 | on Linux dgx 5.4.0-144-generic #161-Ubuntu SMP Fri Feb 3 14:49:04 UTC 2023 x86_64 x86_64 x86_64 GNU/Linux
Status: Draft

Writeup

“A kubernetes cluster is composed of master nodes and worker nodes. the master nodes run the control plane components”. these components include

API server (front end of the kubectl CLI),
etcd (stores the cluster state and the others)
Scheduler
Controller Manager

Control plane components can have some impact on CPU intensive tasks and conversely, CPU or HDD/SSD intensive tasks can have a high impact on your control plane components.

This document is an adaptation of the nvidia-guide for datacenter, k8 setup.

Control plane components

Ideally use CPU-only (GPU Free) master nodes to run the control plane components:

`kubeadm` pre-requisites

Check network adapters and ports for their availability (if your master and slave nodes are in different network segments firewall attributes too). (e.g. port check
Disable swap on the nodes so kubelet can work correctly
container runtime like Docker, containerd or CRI-O

for ubuntu follow this to install k8. (upto initialising kubeadm with kubeadm init)

If you are using dgx, docker is pre installed so I think it makes sense to use it rather than installing a different CRI, to use Docker’s CRI head to [cri-dockerd repo][https://github.com/Mirantis/cri-dockerd] and follow the instructions or download pre-built-binaries(this one is for amd64 released on January 2023)

Note: when initialising kubeadm use the correct IP-ranges (e.g. 192.168.0.0/16 or 172.16.1.0/24, I used 172.16.5.0/28 so I can have \(2^4 - 2= 14\) (omitting the two end addresses) host addresses (240-254) )

Refer to kubeadm init page for more info.

initialise kubeadm (I am using cri-dockerd and the network config 172.16.5.0/24)

If an instance is already running try resetting with

sudo kubeadm reset --cri-socket=unix:///var/run/cri-dockerd.sock

Note: making sure the network policy that is once set by calico(if used) is released, for a hard reset check this After resetting policies: restart docker

sudo kubeadm init --pod-network-cidr=172.16.5.0/24 --cri-socket=unix:///var/run/cri-dockerd.sock

You should get something like below (this is what I got)

[init] Using Kubernetes version: v1.26.2
[preflight] Running pre-flight checks
[preflight] Pulling images required for setting up a Kubernetes cluster
[preflight] This might take a minute or two, depending on the speed of your internet connection
[preflight] You can also perform this action in beforehand using 'kubeadm config images pull'
[certs] Using certificateDir folder "/etc/kubernetes/pki"
[certs] Generating "ca" certificate and key
[certs] Generating "apiserver" certificate and key
[certs] apiserver serving cert is signed for DNS names [gsrv kubernetes kubernetes.default kubernetes.default.svc kubernetes.default.svc.cluster.local] and IPs [10.96.0.1 172.16.1.19]
[certs] Generating "apiserver-kubelet-client" certificate and key
[certs] Generating "front-proxy-ca" certificate and key
[certs] Generating "front-proxy-client" certificate and key
[certs] Generating "etcd/ca" certificate and key
[certs] Generating "etcd/server" certificate and key
[certs] etcd/server serving cert is signed for DNS names [gsrv localhost] and IPs [172.16.1.19 127.0.0.1 ::1]
[certs] Generating "etcd/peer" certificate and key
[certs] etcd/peer serving cert is signed for DNS names [gsrv localhost] and IPs [172.16.1.19 127.0.0.1 ::1]
[certs] Generating "etcd/healthcheck-client" certificate and key
[certs] Generating "apiserver-etcd-client" certificate and key
[certs] Generating "sa" key and public key
[kubeconfig] Using kubeconfig folder "/etc/kubernetes"
[kubeconfig] Writing "admin.conf" kubeconfig file
[kubeconfig] Writing "kubelet.conf" kubeconfig file
[kubeconfig] Writing "controller-manager.conf" kubeconfig file
[kubeconfig] Writing "scheduler.conf" kubeconfig file
[kubelet-start] Writing kubelet environment file with flags to file "/var/lib/kubelet/kubeadm-flags.env"
[kubelet-start] Writing kubelet configuration to file "/var/lib/kubelet/config.yaml"
[kubelet-start] Starting the kubelet
[control-plane] Using manifest folder "/etc/kubernetes/manifests"
[control-plane] Creating static Pod manifest for "kube-apiserver"
[control-plane] Creating static Pod manifest for "kube-controller-manager"
[control-plane] Creating static Pod manifest for "kube-scheduler"
[etcd] Creating static Pod manifest for local etcd in "/etc/kubernetes/manifests"
[wait-control-plane] Waiting for the kubelet to boot up the control plane as static Pods from directory "/etc/kubernetes/manifests". This can take up to 4m0s
[apiclient] All control plane components are healthy after 22.503248 seconds
[upload-config] Storing the configuration used in ConfigMap "kubeadm-config" in the "kube-system" Namespace
[kubelet] Creating a ConfigMap "kubelet-config" in namespace kube-system with the configuration for the kubelets in the cluster
[upload-certs] Skipping phase. Please see --upload-certs
[mark-control-plane] Marking the node gsrv as control-plane by adding the labels: [node-role.kubernetes.io/control-plane node.kubernetes.io/exclude-from-external-load-balancers]
[mark-control-plane] Marking the node gsrv as control-plane by adding the taints [node-role.kubernetes.io/control-plane:NoSchedule]
[bootstrap-token] Using token: ju10gj.8mu5ziozryur4dnc
[bootstrap-token] Configuring bootstrap tokens, cluster-info ConfigMap, RBAC Roles
[bootstrap-token] Configured RBAC rules to allow Node Bootstrap tokens to get nodes
[bootstrap-token] Configured RBAC rules to allow Node Bootstrap tokens to post CSRs in order for nodes to get long term certificate credentials
[bootstrap-token] Configured RBAC rules to allow the csrapprover controller automatically approve CSRs from a Node Bootstrap Token
[bootstrap-token] Configured RBAC rules to allow certificate rotation for all node client certificates in the cluster
[bootstrap-token] Creating the "cluster-info" ConfigMap in the "kube-public" namespace
[kubelet-finalize] Updating "/etc/kubernetes/kubelet.conf" to point to a rotatable kubelet client certificate and key
[addons] Applied essential addon: CoreDNS
[addons] Applied essential addon: kube-proxy

Your Kubernetes control-plane has initialized successfully!

To start using your cluster, you need to run the following as a regular user:

  mkdir -p $HOME/.kube
  sudo cp -i /etc/kubernetes/admin.conf $HOME/.kube/config
  sudo chown $(id -u):$(id -g) $HOME/.kube/config

Alternatively, if you are the root user, you can run:

  export KUBECONFIG=/etc/kubernetes/admin.conf

You should now deploy a pod network to the cluster.
Run "kubectl apply -f [podnetwork].yaml" with one of the options listed at:
  https://kubernetes.io/docs/concepts/cluster-administration/addons/

Then you can join any number of worker nodes by running the following on each as root:

kubeadm join 172.16.1.19:6443 --token <a_token> \
	--discovery-token-ca-cert-hash sha256:<a_unique_hash_string> 

If you have an issue with multiple CRI endpoints please refer to this stackoverflow answer
You can find more info in installing kubectl here
If you want to verify the install check here here

Pulling config images (I don’t fully understand this yet, just following the prompt hopefully will update as I learn more)

kubeadm config image pull

Output:

[config/images] Pulled registry.k8s.io/kube-apiserver:v1.26.2
[config/images] Pulled registry.k8s.io/kube-controller-manager:v1.26.2
[config/images] Pulled registry.k8s.io/kube-scheduler:v1.26.2
[config/images] Pulled registry.k8s.io/kube-proxy:v1.26.2
[config/images] Pulled registry.k8s.io/pause:3.9
[config/images] Pulled registry.k8s.io/etcd:3.5.6-0
[config/images] Pulled registry.k8s.io/coredns/coredns:v1.9.3

Now we try: kubectl get pods -A

output:

NAMESPACE     NAME                           READY   STATUS    RESTARTS   AGE
kube-system   coredns-787d4945fb-46m9v       0/1     Pending   0          137m
kube-system   coredns-787d4945fb-bf82h       0/1     Pending   0          137m
kube-system   etcd-gsrv                      1/1     Running   0          137m
kube-system   kube-apiserver-gsrv            1/1     Running   0          137m
kube-system   kube-controller-manager-gsrv   1/1     Running   0          137m
kube-system   kube-proxy-k95h8               1/1     Running   0          137m
kube-system   kube-scheduler-gsrv            1/1     Running   0          137m

Note: judging by the output of this command aprt from the "coredns-"* everything seems to be in the "RUNNINNG" state. I belive at this point I can get things working with numerical hostnames insteads of strings.

because of the following lines in the kubeadm init output from earlier I think the next step is to get some worker node(s) running.

Then you can join any number of worker nodes by running the following on each as root:

kubeadm join 172.16.1.19:6443 --token <a_token> \
	--discovery-token-ca-cert-hash sha256:<a_unique_hash_string> 

HELM

Now we install HELM (a package manager for Kubernetes)

#make install directory (in case it doesn't exist)
sudo mkdir -p /usr/local/bin

you can install helm in many ways I used the script method

Once Helm is in place you can do a quick tutorial to understand how it works.

GPU Nodes

At this point I think we have a working Kubernetes control plane

Nvidia Device plugin for Kubernetes

Kubernetes provide a device plugin framework that you can use to advertise system hardware resources to the Kubelet. With this you or other hardware vendors such as nvidia can implements device plugins that can be either installed manually or as a DaemonSet. DGX is an example for one of these devices.

Kubernetes also provides an operator framework to help package, deploy and mange applications k8 applications. the operator framework is essentially an extension of the support structure that is usually necessary for an application to operate in a working environment (i.e. hardware, OS, drivers, telemetry and health/services and management)

ngc catalog link nvidia-gpu operator

I think I should be able to make API calls to the master node and get them relayed to worker nodes when the worker nodes are initialised(or joined with the token)

enable cri_plugin in DGX

/etc/containerd/config.toml 

# disabled_plugins = ["cri"]

#root = "/var/lib/containerd"
#state = "/run/containerd"
#subreaper = true
#oom_score = 0

#[grpc]
#  address = "/run/containerd/containerd.sock"
#  uid = 0
#  gid = 0

#[debug]
#  address = "/run/containerd/debug.sock"
#  uid = 0
#  gid = 0
#  level = "info"                

Using the token I created earlier (after sudo systemctl restart containerd in the DGX an)

sudo kubeadm join [IP]:6443 --token <token> --discovery-token-ca-cert-hash <hash> 

output:

[preflight] Running pre-flight checks
[preflight] Reading configuration from the cluster...
[preflight] FYI: You can look at this config file with 'kubectl -n kube-system get cm kubeadm-config -o yaml'
[kubelet-start] Writing kubelet configuration to file "/var/lib/kubelet/config.yaml"
[kubelet-start] Writing kubelet environment file with flags to file "/var/lib/kubelet/kubeadm-flags.env"
[kubelet-start] Starting the kubelet
[kubelet-start] Waiting for the kubelet to perform the TLS Bootstrap...

This node has joined the cluster:
* Certificate signing request was sent to apiserver and a response was received.
* The Kubelet was informed of the new secure connection details.

Run 'kubectl get nodes' on the control-plane to see this node join the cluster.

Now we can test for the node visibility on the control plane with

kubectl get nodes

I ran the code above to get the following output

NAME   STATUS     ROLES           AGE   VERSION
dgx    NotReady   <none>          60m   v1.26.2
gsrv   NotReady   control-plane   8h    v1.26.2

here dgx is the dgx station and the gsrv is the CPU server.

Then install the nvidia HELM REPO

helm repo add nvidia https://helm.ngc.nvidia.com/nvidia \
   && helm repo update

I labelled the dgx node with (ran the command in the master node)

 kubectl label node dgx node-role.kubernetes.io/worker=worker

What is next (awaiting support from nvidia)

Do Ihave to install the nvidia container toolkit on DGX (is is it already installed)
Do I have to Install https://github.com/NVIDIA/k8s-device-plugin#deployment-via-helm in the control plane or the master node

The main problem seems to be the network.

My end goal is to get this(TAO toolkit with the REST API via k8) working.

Note: I tried installing calico from it’s web instructions but didn’t see any changes I still get

Note: My Ansible version from the cpu server which acts as the master node is Ubuntu 22.04.1 LTS

hence when I run ansible localhost -m setup -a 'filter=ansible_distribution_version' from it I get

localhost | SUCCESS => {
    "ansible_facts": {
        "ansible_distribution_version": "22.04"
    },
    "changed": false
}

this leads to

TASK [check os version] ***********************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************
fatal: [172.16.1.19]: FAILED! => {
    "assertion": "ansible_distribution_version in ['18.04', '20.04']",
    "changed": false,
    "evaluated_to": false,
    "msg": "Assertion failed"
}

when I run. this instruction

g@gsrv:~/k8$ kubectl get nodes
NAME   STATUS     ROLES           AGE    VERSION
dgx    NotReady   worker          164m   v1.26.2
gsrv   NotReady   control-plane   10h    v1.26.2
g@gsrv:~/k8$ 

when I get more info with describe node I get

g@gsrv:~/k8$ kubectl describe node dgx
Name:               dgx
Roles:              worker
Labels:             beta.kubernetes.io/arch=amd64
                    beta.kubernetes.io/os=linux
                    kubernetes.io/arch=amd64
                    kubernetes.io/hostname=dgx
                    kubernetes.io/os=linux
                    node-role.kubernetes.io/worker=worker
Annotations:        kubeadm.alpha.kubernetes.io/cri-socket: unix:///var/run/containerd/containerd.sock
                    node.alpha.kubernetes.io/ttl: 0
                    volumes.kubernetes.io/controller-managed-attach-detach: true
CreationTimestamp:  Fri, 10 Mar 2023 17:27:08 +0000
Taints:             node.kubernetes.io/not-ready:NoSchedule
Unschedulable:      false
Lease:
  HolderIdentity:  dgx
  AcquireTime:     <unset>
  RenewTime:       Fri, 10 Mar 2023 20:14:16 +0000
Conditions:
  Type             Status  LastHeartbeatTime                 LastTransitionTime                Reason                       Message
  ----             ------  -----------------                 ------------------                ------                       -------
  MemoryPressure   False   Fri, 10 Mar 2023 20:11:34 +0000   Fri, 10 Mar 2023 17:27:08 +0000   KubeletHasSufficientMemory   kubelet has sufficient memory available
  DiskPressure     False   Fri, 10 Mar 2023 20:11:34 +0000   Fri, 10 Mar 2023 17:27:08 +0000   KubeletHasNoDiskPressure     kubelet has no disk pressure
  PIDPressure      False   Fri, 10 Mar 2023 20:11:34 +0000   Fri, 10 Mar 2023 17:27:08 +0000   KubeletHasSufficientPID      kubelet has sufficient PID available
  Ready            False   Fri, 10 Mar 2023 20:11:34 +0000   Fri, 10 Mar 2023 17:27:08 +0000   KubeletNotReady              container runtime network not ready: NetworkReady=false reason:NetworkPluginNotReady message:Network plugin returns error: cni plugin not initialized
Addresses:
  InternalIP:  172.16.3.2
  Hostname:    dgx
Capacity:
  cpu:                128
  ephemeral-storage:  1843269236Ki
  hugepages-1Gi:      0
  hugepages-2Mi:      0
  memory:             528018224Ki
  pods:               110
Allocatable:
  cpu:                128
  ephemeral-storage:  1698756925085
  hugepages-1Gi:      0
  hugepages-2Mi:      0
  memory:             527915824Ki
  pods:               110
System Info:
  Machine ID:                 fe86cb15be594307a11cc9847b0eb5c2
  System UUID:                21af0608-1dd2-11b2-9c02-f24e4f55ad5c
  Boot ID:                    0f544fd5-41e8-4f48-b326-0c58d2e99fb9
  Kernel Version:             5.4.0-144-generic
  OS Image:                   Ubuntu 20.04.5 LTS
  Operating System:           linux
  Architecture:               amd64
  Container Runtime Version:  containerd://1.6.10
  Kubelet Version:            v1.26.2
  Kube-Proxy Version:         v1.26.2
Non-terminated Pods:          (2 in total)
  Namespace                   Name                                CPU Requests  CPU Limits  Memory Requests  Memory Limits  Age
  ---------                   ----                                ------------  ----------  ---------------  -------------  ---
  kube-system                 kube-proxy-gfkkr                    0 (0%)        0 (0%)      0 (0%)           0 (0%)         167m
  tigera-operator             tigera-operator-54b47459dd-qdwtt    0 (0%)        0 (0%)      0 (0%)           0 (0%)         28m
Allocated resources:
  (Total limits may be over 100 percent, i.e., overcommitted.)
  Resource           Requests  Limits
  --------           --------  ------
  cpu                0 (0%)    0 (0%)
  memory             0 (0%)    0 (0%)
  ephemeral-storage  0 (0%)    0 (0%)
  hugepages-1Gi      0 (0%)    0 (0%)
  hugepages-2Mi      0 (0%)    0 (0%)
Events:
  Type    Reason            Age                    From           Message
  ----    ------            ----                   ----           -------
  Normal  CIDRNotAvailable  2m48s (x39 over 167m)  cidrAllocator  Node dgx status is now: CIDRNotAvailable
g@gsrv:~/k8$ 

Resetting DGX

sudo kubeadm reset
W0313 15:23:17.052577   96842 preflight.go:56] [reset] WARNING: Changes made to this host by 'kubeadm init' or 'kubeadm join' will be reverted.
[reset] Are you sure you want to proceed? [y/N]: y
[preflight] Running pre-flight checks
W0313 15:23:18.416975   96842 removeetcdmember.go:106] [reset] No kubeadm config, using etcd pod spec to get data directory
[reset] Deleted contents of the etcd data directory: /var/lib/etcd
[reset] Stopping the kubelet service
[reset] Unmounting mounted directories in "/var/lib/kubelet"
[reset] Deleting contents of directories: [/etc/kubernetes/manifests /var/lib/kubelet /etc/kubernetes/pki]
[reset] Deleting files: [/etc/kubernetes/admin.conf /etc/kubernetes/kubelet.conf /etc/kubernetes/bootstrap-kubelet.conf /etc/kubernetes/controller-manager.conf /etc/kubernetes/scheduler.conf]

The reset process does not clean CNI configuration. To do so, you must remove /etc/cni/net.d

The reset process does not reset or clean up iptables rules or IPVS tables.
If you wish to reset iptables, you must do so manually by using the "iptables" command.

If your cluster was setup to utilize IPVS, run ipvsadm --clear (or similar)
to reset your system's IPVS tables.

The reset process does not clean your kubeconfig files and you must remove them manually.
Please, check the contents of the $HOME/.kube/config file.

Setting up DGX as both the master and the GPU worker node

After learning that my CPU server running on ubuntu 22.04 is not compatible with current TAO K8 setup I had to accept defeat and switch to the original guide therefore reverting back to this guide

Disable swap

sudo swapoff -a

Initialise as a master (note i’ve used a different network segment)

 sudo kubeadm init --pod-network-cidr=172.16.5.0/24

output:

[init] Using Kubernetes version: v1.26.2
[preflight] Running pre-flight checks
[preflight] Pulling images required for setting up a Kubernetes cluster
[preflight] This might take a minute or two, depending on the speed of your internet connection
[preflight] You can also perform this action in beforehand using 'kubeadm config images pull'
[certs] Using certificateDir folder "/etc/kubernetes/pki"
[certs] Generating "ca" certificate and key
[certs] Generating "apiserver" certificate and key
[certs] apiserver serving cert is signed for DNS names [dgx kubernetes kubernetes.default kubernetes.default.svc kubernetes.default.svc.cluster.local] and IPs [10.96.0.1 172.16.3.2]
[certs] Generating "apiserver-kubelet-client" certificate and key
[certs] Generating "front-proxy-ca" certificate and key
[certs] Generating "front-proxy-client" certificate and key
[certs] Generating "etcd/ca" certificate and key
[certs] Generating "etcd/server" certificate and key
[certs] etcd/server serving cert is signed for DNS names [dgx localhost] and IPs [172.16.3.2 127.0.0.1 ::1]
[certs] Generating "etcd/peer" certificate and key
[certs] etcd/peer serving cert is signed for DNS names [dgx localhost] and IPs [172.16.3.2 127.0.0.1 ::1]
[certs] Generating "etcd/healthcheck-client" certificate and key
[certs] Generating "apiserver-etcd-client" certificate and key
[certs] Generating "sa" key and public key
[kubeconfig] Using kubeconfig folder "/etc/kubernetes"
[kubeconfig] Writing "admin.conf" kubeconfig file
[kubeconfig] Writing "kubelet.conf" kubeconfig file
[kubeconfig] Writing "controller-manager.conf" kubeconfig file
[kubeconfig] Writing "scheduler.conf" kubeconfig file
[kubelet-start] Writing kubelet environment file with flags to file "/var/lib/kubelet/kubeadm-flags.env"
[kubelet-start] Writing kubelet configuration to file "/var/lib/kubelet/config.yaml"
[kubelet-start] Starting the kubelet
[control-plane] Using manifest folder "/etc/kubernetes/manifests"
[control-plane] Creating static Pod manifest for "kube-apiserver"
[control-plane] Creating static Pod manifest for "kube-controller-manager"
[control-plane] Creating static Pod manifest for "kube-scheduler"
[etcd] Creating static Pod manifest for local etcd in "/etc/kubernetes/manifests"
[wait-control-plane] Waiting for the kubelet to boot up the control plane as static Pods from directory "/etc/kubernetes/manifests". This can take up to 4m0s
[apiclient] All control plane components are healthy after 4.501125 seconds
[upload-config] Storing the configuration used in ConfigMap "kubeadm-config" in the "kube-system" Namespace
[kubelet] Creating a ConfigMap "kubelet-config" in namespace kube-system with the configuration for the kubelets in the cluster
[upload-certs] Skipping phase. Please see --upload-certs
[mark-control-plane] Marking the node dgx as control-plane by adding the labels: [node-role.kubernetes.io/control-plane node.kubernetes.io/exclude-from-external-load-balancers]
[mark-control-plane] Marking the node dgx as control-plane by adding the taints [node-role.kubernetes.io/control-plane:NoSchedule]
[bootstrap-token] Using token: hj5i5l.b03mf63sxolhn28n
[bootstrap-token] Configuring bootstrap tokens, cluster-info ConfigMap, RBAC Roles
[bootstrap-token] Configured RBAC rules to allow Node Bootstrap tokens to get nodes
[bootstrap-token] Configured RBAC rules to allow Node Bootstrap tokens to post CSRs in order for nodes to get long term certificate credentials
[bootstrap-token] Configured RBAC rules to allow the csrapprover controller automatically approve CSRs from a Node Bootstrap Token
[bootstrap-token] Configured RBAC rules to allow certificate rotation for all node client certificates in the cluster
[bootstrap-token] Creating the "cluster-info" ConfigMap in the "kube-public" namespace
[kubelet-finalize] Updating "/etc/kubernetes/kubelet.conf" to point to a rotatable kubelet client certificate and key
[addons] Applied essential addon: CoreDNS
[addons] Applied essential addon: kube-proxy

Your Kubernetes control-plane has initialized successfully!

To start using your cluster, you need to run the following as a regular user:

  mkdir -p $HOME/.kube
  sudo cp -i /etc/kubernetes/admin.conf $HOME/.kube/config
  sudo chown $(id -u):$(id -g) $HOME/.kube/config

Alternatively, if you are the root user, you can run:

  export KUBECONFIG=/etc/kubernetes/admin.conf

You should now deploy a pod network to the cluster.
Run "kubectl apply -f [podnetwork].yaml" with one of the options listed at:
  https://kubernetes.io/docs/concepts/cluster-administration/addons/

Then you can join any number of worker nodes by running the following on each as root:

kubeadm join 172.16.3.2:6443 --token <a.token> \
  --discovery-token-ca-cert-hash sha256:<a_hash> 

Note: The tokens will expire after some time use kubeadm token list to check the current valid tokens

Then ran these commands as requested

mkdir -p $HOME/.kube
sudo cp -i /etc/kubernetes/admin.conf $HOME/.kube/config
sudo chown $(id -u):$(id -g) $HOME/.kube/config

Network stuff

The suggested command below fails

kubectl apply -f https://docs.projectcalico.org/manifests/calico.yaml

Try this fix suggest by me.

untaint the control plane

This is only so the DGX (now a master node because we ran kubeadm init on that) can be used to shedule GPU pods.

kubectl taint nodes --all node-role.kubernetes.io/master-

again! this command also fails

so try changing the default control plane node as suggested by this

kubectl taint nodes --all node-role.kubernetes.io/control-plane-

Now you may be able to list join tokens with

kubeadm token list

create new tokens with

kubeadm token create

for more info (e.g. join new worker nodes to a master node) check this link

TAO API SETUP

Prep: I created a SSH key pair and added it to the ssh agent with ssh-add so I can use key authentication for k8.

Check the next topic

Click here to report Errors, make Suggestions or Comments!

Installing Kubernetes on DGX using Kubeadm PyTorch

Writeup

Control plane components

kubeadm pre-requisites

HELM

GPU Nodes

Nvidia Device plugin for Kubernetes

What is next (awaiting support from nvidia)

Resetting DGX

Setting up DGX as both the master and the GPU worker node

Network stuff

untaint the control plane

TAO API SETUP

`kubeadm` pre-requisites