AI/ML with VMware Cloud Director

AI/ML—short for artificial intelligence (AI) and machine learning (ML)—represents an important evolution in computer science and data processing that is quickly transforming a vast array of industries.

Why is AI/ML important?

it’s no secret that data is an increasingly important business asset, with the amount of data generated and stored globally growing at an exponential rate. Of course, collecting data is pointless if you don’t do anything with it, but these enormous floods of data are simply unmanageable without automated systems to help.

Artificial intelligence, machine learning and deep learning give organizations a way to extract value out of the troves of data they collect, delivering business insights, automating tasks and advancing system capabilities. AI/ML has the potential to transform all aspects of a business by helping them achieve measurable outcomes including:

  • Increasing customer satisfaction
  • Offering differentiated digital services
  • Optimizing existing business services
  • Automating business operations
  • Increasing revenue
  • Reducing costs

As modern applications become more prolific, Cloud Providers need to address the increasing customer demand for accelerated computing that typically requires large volumes of multiple, simultaneous computation that can be met with GPU capability.

Cloud Providers can now leverage vSphere support for NVIDIA GPUs and NVIDIA AI Enterprise (a cloud-native software suite for the development and deployment of AI and has been optimized and certified for VMware vSphere), This enables vSphere capabilities like vMotion from within Cloud Director to now deliver multi-tenancy GPU services which are key to maximizing GPU resource utilization. With Cloud Director support for the NVIDIA AI Enterprise software suite, customers now have access to best-in-class, GPU optimized AI frameworks and tools and to deliver compute intensive workloads including artificial intelligence (AI) or machine learning (ML) applications within their datacenters.

This solution with NVIDIA takes advantage of NVIDIA MIG (Multi-instance GPU) which supports spatial segmentation between workloads at the physical level inside a single device and is a big deal for multi-tenant environments driving better optimization of hardware and increased margins. Cloud Director is reliant on host pre-configuration for GPU services included in NVIDIA AI Enterprise which contains vGPU technology to enable deployment/configuration on hosts and GPU profiles.

Customers can self serve, manage and monitor their GPU accelerated hosts and virtual machines within Cloud Director. Cloud Providers are able to monitor (through vCloud API and UI dashboard) NVIDIA vGPU allocation, usage per VDC and per VM to optimize utilization and meter/bill (through vCloud API) NVIDIA vGPU usage averaged over a unit of time per tenant for tenant billing.

Provider Workflow

  • Add GPU devices to ESXi hosts in vCenter and install required drivers. 
  • Verify vGPU profiles are visible by going in to vCD provider portal → Resources → Infrastructure Resources → vGPU Profiles
  • Edit vGPU profiles to provide necessary tenant facing instructions and a tenant facing name to each vGPU profile. (Optional)
  • Create a PVDC backed by one or more clusters having GPU hosts in vCenter.
  • In provider portal → Cloud Resources → vGPU Policies → Create a new vGPU policy by following the wizards steps.

Tenant Workflow

When you create a vGPU policy, it is not visible to tenants. You can publish a vGPU policy to an organization VDC to make it available to tenants.

Publishing a vGPU policy to an organization VDC makes the policy visible to tenants. The tenant can select the policy when they create a new standalone VM or a VM from a template, edit a VM, add a VM to a vApp, and create a vApp from a vApp template. You cannot delete a vGPU policy that is available to tenants.

  • Publish the vGPU policy to one or more tenant VDCs similar to the way we publish sizing and placement policies.
  • Create a new VM or instantiate a VM from template. In vGPU enabled VDCs, tenants can now select a vGPU policy

Cloud Director not only allows for VMs but providers can also leverage cloud director’s Container Service Extension to offer GPU enabled Tanzu Kubernetes Clusters.

Step-by-Step Configuration

Below video covers step-by-step process of configuring provider and tenant side of configuration as well as deploying Tensor flow GPU in to a VM.

5 thoughts on “AI/ML with VMware Cloud Director

  1. Great Video, thanks. One question. Every demo i see have just one host in a cluster. I’ve tried it with 4 hosts where one has GPU installed and despite vcloud showing nvidia icon next to provider VDC, i cannot see any vgpu Profiles listed.
    Do you know if it is a requirement to have all hosts in a cluster with GPUs installed?

    Like

    • Thanks! It actually worked! I’ve restarted vcloud and it all showing now. I thought about it, but didn’t really execute it as I though if it is showing Nvidia icon it probably propagated already.

      Like

Leave a comment