vGPU/SR-IOV GPU

Related issues: #1661 vGPU Support

Pre-requisite Enable PCI devices

Create a harvester cluster in bare metal mode. Ensure one of the nodes has NIC separate from the management NIC
Go to the management interface of the new cluster
Go to Advanced -> PCI Devices
Validate that the PCI devices aren’t enabled
Click the link to enable PCI devices
Enable PCI devices in the linked addon page
Wait for the status to change to Deploy Successful
Navigate to the PCI devices page
Validate that the PCI devices page is populated/populating with PCI devices

Pre-requisite Enable vGPU

This can only be ran on a bare metal Harvester cluster that has an Nvidia card that support vGPU. You will also need the Nvidia KVM driver and the Nvidia grid installer. These can be downloaded from Nvidia through your partner portal as outlined here

After the PCI devices are enabled navigate to the nvidia-driver-toolkit addon and enable it
Wait for the status to change to Deploy Successful
Edit the config for the nvidia-driver-toolkit from the addons page and set the driver location for the KVM driver
Navigate to SR-IOV GPU Devices
Wait for the GPU to show up in the list
Enable the GPU
Wait for it to show as enabled and populate with the vGPU devices
Navigate to vGPU devices
Enable one of the vPGU devices and select a profile
Validate that it now shows as enabled

Test Cases

These tests were ran on Ubuntu focal KVM live images

The setup for the VMs were as follows

wget https://developer.download.nvidia.com/compute/cuda/repos/wsl-ubuntu/x86_64/cuda-keyring_1.1-1_all.deb
sudo dpkg -i cuda-keyring_1.1-1_all.deb
sudo apt-get update
sudo apt-get -y install cuda nvidia-cuda-toolkit build-essential

git clone https://github.com/nvidia/cuda-samples
cd cuda-samples/Samples/3_CUDA_Features/cudaTensorCoreGemm
make
./cudaTensorCoreGemm

Add one vGPU on VM creation

Create a VM and select the vGPU in vGPU devices
After the VM is created run the pre-reqs as outlined above
Run the grid installer for the vGPU driver on the VM
Run the ./cuda-samples/Samples/3_CUDA_Features/cudaTensorCoreGemm

Expected Results

The VM should create successfully
The code should execute successfully

Add one vGPU after VM creation

Create a VM and don’t select the vGPU in vGPU devices
After the VM is created run the pre-reqs as outlined above
Edit the config of the VM and add the vGPU and select to restart
Run the grid installer for the vGPU driver on the VM
Run the ./cuda-samples/Samples/3_CUDA_Features/cudaTensorCoreGemm

Expected Results

The VM should create successfully
The code should execute successfully

Remove vGPU from VM

Create a VM and select the vGPU in vGPU devices
After the VM is created run the pre-reqs as outlined above
Run the grid installer for the vGPU driver on the VM
Run the ./cuda-samples/Samples/3_CUDA_Features/cudaTensorCoreGemm
Edit the VM config and remove the vGPU and select to restart
Run the ./cuda-samples/Samples/3_CUDA_Features/cudaTensorCoreGemm

Expected Results

The VM should create successfully
The code should execute successfully
After removal of the vGPU the code should report that there aren’t any cuda compatible cards available.

Disable vGPU device that is in use

This is to be run on a vGPU device that is currently assigned to a VM

Disable vGPU device that is in use

Expected Results

You should get an error that says that the vGPU is in use

Add two vGPUs to VM on creation

There are some limitations to this and they are outlined here

Create a VM and select two vGPUs in vGPU devices
Edit the yaml of the VM to add in the following yaml to one of the vGPUs

virtualGPUOptions:
  display:
     ramFB:
       enabled: false

After the VM is created run the pre-reqs as outlined above
Run the grid installer for the vGPU driver on the VM
Run the ./cuda-samples/Samples/3_CUDA_Features/cudaTensorCoreGemm

Expected Results

The VM should create successfully
The code should execute successfully

Negative try to provision a VM with an already in use vGPU

The way that vGPUs work is in a pool of resources so the easiest way to test this is if you only have one vGPU device enabled

Create a VM and try to select the vGPU in vGPU devices

Expected Results

No vGPU device should show up in the dropdown

Negative try to enable more vGPUs than the card supports

This is going to vary based on which card you have since the driver itself is loading what’s available to the /sys tree

Attempt to enable a vGPU device after the card is fully provisioned

Expected Results

There should be no profiles available in the dropdown

Negative try to enable vGPU on card that doesn’t support it

This should be ran on a server that has an Nvidia GPU that doesn’t support vGPU

After the PCI devices are enabled navigate to the nvidia-driver-toolkit addon and enable it
Wait for the status to change to Deploy Successful
Edit the config for the nvidia-driver-toolkit from the addons page and set the driver location for the KVM driver
Navigate to SR-IOV GPU Devices

Expected Results

The GPU should not be listed in SR-IOV GPU Devices