Notes from Deploying NAI 2.6 on Bare-Metal NKP with Everpure Storage [OSS Agentic Coding Part 1]

I’ve worked with Nutanix Enterprise AI (NAI) a lot over the last few months. I’ve deployed it across several Nutanix Kubernetes Platform (NKP) architectures: VMs on Nutanix HCI, VMs on Nutanix Cloud Platform with External Storage and bare-metal Ubuntu backed by Everpure. This post is about the last, the most unusual of the three and the platform for my broadest set of experiments.

This will be the first in a series of posts about my work with this cluster. In this post I will focus on the architecture and initial set-up. Later I will cover post-deployment activities including break/fix troubleshooting, model deployment, tuning and other tips and trick. In the final post I will cover building up an open-source gstack-style agentic coding setup driven by OpenCode and leveraging Qwen3.6-27B-FP8, gemma-4-31B-it and gpt-oss-120b running on NAI.

Why did I choose this architecture? I needed to test NAI + NKP + GPU pass-through. The only GPU nodes I had available didn’t have M.2 boot drives but I did have access to an Everpure array. Since NKP works with bare-metal Linux, and NAI just needs kubernetes, this seemed like sufficient hardware to make it work.

Architectural diagram showing the configuration of NAI on NKP with storage from Everpure.

The Architecture

This is what I built. NKP 2.17 is deployed as a converged management pod with the control plane nodes running as VMs on another lab cluster and the GPU hosts as workers. NAI 2.6.0 was deployed from the NKP Applications marketplace which was slick. Storage is a single Pure FlashArray, accessed via the Portworx CSI (PX-CSI) driver in two modes: FlashArray Direct Access for RWO block volumes (PostgreSQL, ClickHouse, Prometheus) and FA File Services for the RWX NFS share that holds the models.

During this deployment I learned a few lessons related to the storage configuration. I will cover that part in detail so I remember for next time and so others can leverage those lessons while performing similar deployments.

Choosing a CSI driver

My past NAI deployments all used the Nutanix CSI driver which works great… if you have Nutanix storage. I wasn’t sure what to use this time. Longhorn CSI was installed to access local storage from the nodes, but research (ie testing) showed it isn’t a generic CSI either and is for Longhorn storage specifically. LLMs (Gemini, ChatGPT and Claude Opus) all told me to use Pure Service Orchestrator (PSO), and Google agreed. I tried PSO but it didn’t work, and this didn’t seem like a situation where I should have to spend a lot of time troubleshooting. I asked one more LLM, Grok this time. It’d recently come out with a new “society of mind” multi-agent architecture and it finally cleared things up: Pure deprecated PSO in favor of PX-CSI after the Portworx acquisition. This was only a couple months ago but all LLMs have released 1-2 minor version updates since then so your results may vary. Then again, the first two organic Google hits for “pure csi” are still PSO, so, maybe not:

Google results for “pure csi” still find PSO

Configuring PX-CSI for NAI

Here is what I had to do to successfully configure PX-CSI for unified block and file (UBF) for NAI:

Install PX-CSI following the instructions
Force PX-CSI to quit trying to create FlashBlade StorageClasses:

PX-CSI’s autodiscovery appears to have a baked-in assumption that if you’re doing NFS, you have a FlashBlade, and auto-creates storage classes for px-fb-direct-access-nfsv3 and px-fb-direct-access-nfsv4 which kept getting set as default for NFS. To get around this I disabled the autodiscover feature:

1
2
3
kubectl delete sc px-fb-direct-access-nfsv3 px-fb-direct-access-nfsv4
kubectl edit storagecluster -n portworx px-cluster   # add under metadata.annotations:
portworx.io/disable-storage-class: "true"

Manually create FA StorageClasses (SC)

Now that auto-discover was disabled I needed to manually create my two FA SCs:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
apiVersion: storage.k8s.io/v1
kind: StorageClass
metadata:
  name: px-fa-direct-access
  annotations:
    storageclass.kubernetes.io/is-default-class: "true"
provisioner: pxd.portworx.com
parameters:
  backend: pure_block
allowVolumeExpansion: true
reclaimPolicy: Delete
volumeBindingMode: Immediate

I made px-fa-direct-access the default SC.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
apiVersion: storage.k8s.io/v1
kind: StorageClass
metadata:
  name: px-fa-nfs
provisioner: pxd.portworx.com
parameters:
  backend: "pure_fa_file"
  pure_fa_file_system: "lab-01-nfs"
  pure_nfs_policy: "lab-01nfs-export"
allowVolumeExpansion: true
reclaimPolicy: Delete
volumeBindingMode: Immediate

Create SCs (plural) required for NAI

The NAI deployment for NKP are very explicit about the requirement for a storage class named “nai-nfs-storage:” Screenshot of NAI 2.6.0 documentation showing requirements for nai-nfs-storage so I created a copy of px-fa-nfs with that name. Less obvious, but to be fair still documented in the “Nutanix Enterprise AI Configuration Parameters for the nai-operators Helm Chart” section of the documentation is that Clickhouse Keeper and Clickhouse Server both have a storageClass parameter that defaults to “nutanix-volume.” I found this out when ClickHouse PVCs stayed Pending with error: ‘failed to create Directory (400): Msg1: File system does not exist.’ after I tried to enable NAI. I resolved this by creating a copy of the px-fa-direct-access SC called nutanix-volume but theoretically I also could’ve updated nai-clickhousekeeper.clickhouseKeeper.storage.storageClass and nai-clickhouseserver.clickhouse.storage.storageClass but I didn’t want to run into any more surprises. I suspect this SC is created by default when you run NKP on NCP.

So finally, I had this: kubectl get sc

1
2
3
4
5
NAME                            PROVISIONER          RECLAIMPOLICY   VOLUMEBINDINGMODE   ALLOWVOLUMEEXPANSION   AGE
nai-nfs-storage                 pxd.portworx.com     Delete          Immediate           true                   71d
nutanix-volume                  pxd.portworx.com     Delete          Immediate           true                   71d
px-fa-direct-access (default)   pxd.portworx.com     Delete          Immediate           true                   71d
px-fa-nfs                       pxd.portworx.com     Delete          Immediate           true                   71d

Then I deleted the pending pods and the install completed:

kubectl delete pod -n nai-system --field-selector=status.phase=Pending --force --grace-period=0

While working on CSI, I noticed the iSCSI interfaces on Everpure were set to use Jumbo Frames:

Screenshot of Everpure interface setting

I decided to match the MTU by updating my worker nodes to 9000 to align to Pure’s performance recommendation.

Until next time…

Bare-metal NKP + external storage works great for NAI, but the NAI on NKP deployment instructions may assume NKP on NCP, so it might be worth your time to double check the NAI AI Configuration Parameters before you start and pay close attention to your storage classes.

In a later post in this series, I’ll cover model deployment, endpoint deployment and post-deployment tasks using both pre-validated models and the “Import Models > From Hugging Face Model Hub” feature — see part 2 for those details.

The Architecture#

Choosing a CSI driver#

Configuring PX-CSI for NAI#

Until next time…#

The Architecture

Choosing a CSI driver

Configuring PX-CSI for NAI

Until next time…