[{"content":" NOTE: This post was written by Sam\u0026rsquo;s agentic coding stack, not by Sam. Screenshots of the process are at the end. Introduction My agentic coding stack is set up for actual code projects, not blog writing. Having it write this post was just a fun way to show how it works. I wanted a concrete, end-to-end example of the architecture, the toolchain, and how the agents hand work off to each other.\nThis post documents how the stack is configured. I\u0026rsquo;ll walk through the components, the agent definitions, and the configuration that ties everything together. Then I\u0026rsquo;ll show how we built a reusable callout shortcode (this very box at the top) as a worked example of the workflow.\nBefore this you should be comfortable with Kubernetes, OpenCode\u0026rsquo;s agent model, and Hugo. If you haven\u0026rsquo;t read Notes from Deploying NAI 2.6 on Bare-Metal NKP or Adventures in Model Deployment and Tuning with Nutanix Enterprise AI, you might want to skim those first. They cover the infrastructure and model tuning this post builds on.\nThe Stack Here\u0026rsquo;s the end-to-end chain:\n1 2 3 4 5 6 Agent Definitions (.md) | You (chat UI) --\u0026gt; OpenCode (orchestrator) --\u0026gt; NAI vLLM endpoints --\u0026gt; GPU cluster | v Tools: Git + Hugo + filesystem I run three models across three lineages.\nModel Role Lineage gpt-oss-120b Planner/Reviewer OpenAI Qwen3.6-27B-FP8 Builder/Implementer Alibaba gemma-4-31B-it QA Google The builder (Alibaba, qwen) has its work checked by a reviewer (OpenAI, gpt-oss) and QA (Google, gemma). All three lineages participate; they just don\u0026rsquo;t map one-to-one onto plan/build/review.\nThis separation isn\u0026rsquo;t cosmetic. Different lineages have different failure modes. Having them cross-check each other catches things a single model would gloss over. Part 2 of the NAI series goes deep on the model selection, tuning, and the vLLM arguments each one needs.\nOpenCode Configuration OpenCode is the orchestrator that routes to the NAI endpoints. I\u0026rsquo;ve detailed the provider configuration, including interleaved reasoning and vLLM tuning, in Part 2 of the NAI series.\nAgents (opencode) Each agent is defined locally as a markdown file under the Opencode agents directory (~/.config/opencode/agents/ on Linux/macOS, %USERPROFILE%\\.config\\opencode\\agents\\ on Windows). They define who the agent is, what tools it has access to, and how it should behave. The ones I actually used for this post:\noffice-hours – runs first. Asks clarifying questions, builds a product spec, captures decisions. planner – takes the spec and produces a step-by-step implementation plan. builder – implements the plan: writes code, creates files, runs commands. reviewer – adversarial review of the diff before anything is declared done. qa – build verification and sanity checks. Agents are opinionated. They enforce structure. Without them the models tend to skip ahead to implementation and produce messy diffs.\nExample: planner.md An agent file is straightforward markdown with a YAML front-matter header that tells OpenCode how to route it. Here\u0026rsquo;s the planner:\n1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 --- description: Plans features and architectures before code is written. Use first for any non-trivial change. mode: primary model: nai-demo/gpt-oss-120b temperature: 0.2 reasoning_effort: high permission: edit: deny write: deny bash: \u0026#34;*\u0026#34;: deny \u0026#34;ls *\u0026#34;: allow \u0026#34;cat *\u0026#34;: allow \u0026#34;head *\u0026#34;: allow \u0026#34;tail *\u0026#34;: allow \u0026#34;grep *\u0026#34;: allow \u0026#34;rg *\u0026#34;: allow \u0026#34;find *\u0026#34;: allow \u0026#34;fd *\u0026#34;: allow \u0026#34;tree *\u0026#34;: allow \u0026#34;wc *\u0026#34;: allow \u0026#34;git log*\u0026#34;: allow \u0026#34;git diff*\u0026#34;: allow \u0026#34;git status*\u0026#34;: allow \u0026#34;git show*\u0026#34;: allow \u0026#34;git blame*\u0026#34;: allow --- You are the lead planner. You design before code is written. ## First action, every session Read AGENTS.md if it exists. It defines project conventions, stack, and key files. Do not skip it. ## Your job Given a feature request, defect, or refactor: 1. Survey the relevant code (read-only — you cannot edit). 2. Identify the smallest cohesive change that solves it. 3. Produce a written plan with: goal, affected files, step-by-step approach, risks, test strategy. 4. Hand off to `@builder` to implement. ## Rules - Plan first, code never. You have no write/edit access by design. - Be skeptical of the request as stated. Ask \u0026#34;is the right thing being requested?\u0026#34; before \u0026#34;how do we build it?\u0026#34; - Prefer the smallest correct change. Greenfield rewrites are almost always wrong. - Identify when a request needs decomposition into multiple plans. - If the user supplies a plan, critique it before approving — do not rubber-stamp. ## Output format ## Goal \u0026lt;one sentence\u0026gt; ## Approach \u0026lt;2-5 sentences\u0026gt; ## Files to change - path/to/file.ext — what changes - ... ## Steps 1. ... 2. ... ## Tests \u0026lt;what proves this works\u0026gt; ## Risks \u0026lt;what could go wrong, what to watch\u0026gt; ## Hand-off @builder, implement the above. Read AGENTS.md first. ## Skills available - `gstack-autoplan`, initial breakdown of large features into a plan tree - `gstack-plan-eng-review`, sanity-check your own plan before handing off to @builder - `gstack-plan-tune`, when an existing plan needs revision rather than replacement AGENTS.md Each project has an AGENTS.md that the agents read first. It defines:\nStack and toolchain (Hugo, PaperMod, etc.) Commands to run File naming and frontmatter conventions Voice or coding style rules Off-limits paths Here is the template I use:\n1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 # AGENTS.md ## Stack \u0026lt;!-- language, framework, package manager, runtime version --\u0026gt; ## Commands - build: `\u0026lt;cmd\u0026gt;` - test: `\u0026lt;cmd\u0026gt;` - lint: `\u0026lt;cmd\u0026gt;` - typecheck: `\u0026lt;cmd\u0026gt;` - dev server: `\u0026lt;cmd\u0026gt;` ## Conventions \u0026lt;!-- naming, file layout, import style, anything non-obvious --\u0026gt; \u0026lt;!-- e.g. \u0026#34;all API routes live in src/routes/, named \u0026lt;resource\u0026gt;.ts\u0026#34; --\u0026gt; ## Key files \u0026lt;!-- path — what it is / why agents should know it --\u0026gt; \u0026lt;!-- e.g. src/config.ts — central config, read before touching env vars --\u0026gt; ## Off-limits \u0026lt;!-- paths or patterns agents must not modify without explicit approval --\u0026gt; \u0026lt;!-- e.g. migrations/ — never edit existing migration files --\u0026gt; This file is the project-level guardrail. It replaces the \u0026ldquo;system prompt\u0026rdquo; in traditional AI setups. The content is specific to the repo, not generic. For this site that means: first person, technical depth assumed, short paragraphs, no marketing language.\nBuilding the Callout Shortcode Now for the hands-on part. I needed a callout box at the top of the post to let readers know the agents actually wrote it, so we built a reusable Hugo shortcode as a concrete example of the workflow. The requirements:\nReusable across posts Accept a configurable header label (default: \u0026ldquo;NOTE:\u0026rdquo;) Work in both light and dark theme modes No modifications to the PaperMod submodule Shortcode Template We created layouts/shortcodes/callout.html:\n1 2 3 4 5 6 7 {{- $header := .Get \u0026#34;header\u0026#34; | default \u0026#34;NOTE:\u0026#34; -}} \u0026lt;aside class=\u0026#34;callout-box\u0026#34; role=\u0026#34;note\u0026#34; aria-label=\u0026#34;{{ $header }}\u0026#34;\u0026gt; \u0026lt;strong class=\u0026#34;callout-label\u0026#34;\u0026gt;{{ $header }}\u0026lt;/strong\u0026gt; \u0026lt;div class=\u0026#34;callout-body\u0026#34;\u0026gt; {{ .Inner | markdownify }} \u0026lt;/div\u0026gt; \u0026lt;/aside\u0026gt; The .Get \u0026quot;header\u0026quot; call makes the label user-configurable. I intentionally skipped safeHTML (I want Hugo to escape the header by default so a malicious value injected from a post can\u0026rsquo;t run script). The {{ .Inner | markdownify }} block captures whatever markdown sits between the shortcode tags and renders it explicitly.\nStyling CSS lives in assets/css/extended/custom.css, the designated override path. We used CSS custom properties so the colors adapt to light and dark mode automatically:\n1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 /* Light mode defaults */ .callout-box { border: 1px solid var(--callout-border, #c8a960); background: var(--callout-bg, #faf3e0); padding: 1.25rem 1.5rem; margin: 1.5rem 0; border-radius: 6px; position: relative; } .callout-label { display: block; font-weight: 700; font-size: 0.85rem; text-transform: uppercase; letter-spacing: 0.04em; color: var(--callout-label, #6a4a00); margin-bottom: 0.5rem; } .callout-body { color: var(--callout-body, var(--content)); line-height: 1.6; } .callout-body p:last-child { margin-bottom: 0; } /* Dark mode overrides — scoped to PaperMod\u0026#39;s data-theme attribute */ :root[data-theme=\u0026#34;dark\u0026#34;] { --callout-border: #c8a040; --callout-bg: #2a2418; --callout-label: #f0d060; --callout-body: #d4be8e; } The amber palette was chosen because it has enough contrast in both modes without competing with the site\u0026rsquo;s Da Vinci color scheme.\nUsing It Drop the shortcode anywhere in a markdown post:\n\u0026#123;\u0026#123;\u003c callout header=\"NOTE:\" \u003e\u0026#125;\u0026#125; This post was written by Sam's agentic coding stack, not by Sam. \u0026#123;\u0026#123;\u003c /callout \u003e\u0026#125;\u0026#125; Or with a different label:\n\u0026#123;\u0026#123;\u003c callout header=\"WARNING:\" \u003e\u0026#125;\u0026#125; This is deprecated: use the new API instead. \u0026#123;\u0026#123;\u003c /callout \u003e\u0026#125;\u0026#125; The Workflow Here\u0026rsquo;s what happened in this session. I described the goal, and office-hours kicked in first, it asked about post scope, callout design preferences, color choices, implementation approach, and whether to capture screenshots of the process. Planner took the clarifications and produced a structured plan covering shortcode location, CSS variable strategy, post skeleton, and dark-mode adaptation. Builder read the plan, calibrated voice against existing posts, then created the shortcode, added CSS, and wrote the markdown.\nReviewer found two major issues: an XSS risk in the shortcode and accessibility/contrast problems in the CSS, plus a few medium ones like stale code snippets and empty front-matter fields. Builder fixed each finding one at a time, re-building after each change. QA ran a final build check and confirmed everything was clean.\nThe agents worked through a sequential handoff chain (office-hours, then planner, then builder, then reviewer, then qa), with some handoffs requiring manual re-framing. The callout and draft post were produced in minutes.\nConclusion The stack is a set of agents with different roles and a project-level guardrail file that keeps them aligned. I had them write this post as an exercise; the real value is the plan-build-review-qa loop for actual code projects. That structure catches mistakes that a single model would miss.\nNOTE: Sam is back and wrote this next section This was a fun experiment. Overall it worked quickly and smoothly. Not all of the agent hand-offs worked as intended so I had to intervene more often than I wanted, but that gives me more to work on!\nI don\u0026rsquo;t intend to offload writing in the future. Writing is thinking. Thinking and learning are the points of this exercise. I will definitely continue to use office hours. That genuinely helped me clarify my thoughts.\nScreenshots 1. Initial prompt to office-hours The starting prompt, I described the goal, the callout box feature, and asked for the session to be documented.\n2. Office-hours Q\u0026amp;A Office-hours surfaced clarifying questions about scope, design preferences, and implementation approach before committing to a plan.\n3. Problem statement The structured output from office-hours, a clean problem statement and open question for the planner.\n4. Planner handoff Planner took the problem statement, resolved implementation details, and produced a step-by-step execution plan with a handoff to builder.\n5. Failed reviewer handoff One of the handoffs that didn\u0026rsquo;t work, delegating to the @@reviewer sub-agent required a specific format. The first attempt stalled, so I had to re-frame the prompt.\n6. Reviewer output After the handoff worked, reviewer flagged two major issues (XSS risk, accessibility/contrast) plus a few medium findings.\n7. QA in action The QA sub-agent running build verification and sanity checks against the updated code.\n8. Final review + QA findings QA\u0026rsquo;s output, confirmed the fixes, surfaced a couple of remaining nits, and cleared the final build.\n","permalink":"https://sl-notes.dev/posts/pure-nai-3/","summary":"\u003caside class=\"callout-box\" role=\"note\" aria-label=\"NOTE:\"\u003e\n  \u003cstrong class=\"callout-label\"\u003eNOTE:\u003c/strong\u003e\n  \u003cdiv class=\"callout-body\"\u003e\n    This post was written by Sam\u0026rsquo;s agentic coding stack, not by Sam. Screenshots of the process are at the end.\n  \u003c/div\u003e\n\u003c/aside\u003e\n\n\u003ch2 id=\"introduction\"\u003eIntroduction\u003c/h2\u003e\n\u003cp\u003eMy agentic coding stack is set up for actual code projects, not blog writing. Having it write this post was just a fun way to show how it works. I wanted a concrete, end-to-end example of the architecture, the toolchain, and how the agents hand work off to each other.\u003c/p\u003e","title":"How To: Agentic Coding Stack with Nutanix Enterprise AI and Opencode [OSS Agentic Coding Part 3]"},{"content":"Introduction This is part 2 in an ongoing series about building an open-source agentic coding platform. In part one I covered deploying Nutanix Enterprise AI 2.6 on bare-metal NKP with Pure FlashArray storage via Portworx CSI. In this post I\u0026rsquo;ll review model selection, deployment, tuning and integration with Opencode. Future posts will cover more on the agentic coding setup.\nThis one is denser than part 1 but by the end you\u0026rsquo;ll have three production-tuned models, an Opencode integration, and a small arsenal of kubectl patch tricks to streamline endpoint management.\nModel Selection I wanted three models for my agentic coding setup: one for planning, one for coding, and one for adversarial review and QA. This would allow me to pick models based on their suitability for a particular role. It would also let me enforce a \u0026ldquo;separation of duties\u0026rdquo; across model lineages - with different lineages taking on planning vs. building vs. reviewing.\nWhile I was working on this in April, two powerful new models dropped that looked like perfect candidates for my builder/reviewer roles: gemma 4-31B and Qwen3.6‑27B. Both excel at agentic coding, multimodal reasoning, long‑context handling, tool use, and efficient multi‑step inference. Most importantly, they were small enough to run in my lab with GPU capacity left for a reasoning model to handle planning.\nFor reasoning I landed on gpt-oss-120b:\ngpt-oss was built with multi-step reasoning in mind Its reasoning parser preserves its thinking which allows Opencode to pass the reasoning along with the output to the builder agent for additional context - like \u0026ldquo;here is what to do AND why\u0026rdquo; this is enabled with --reasoning-parser=openai_gptoss It gives me a third different lineage to combine with Google and Alibaba. The plan:\nModel Role Capabilities Lineage gpt-oss-120b Planning Multi-step reasoning OpenAI Qwen3.6-27B Building Reasoning, context, coding Alibaba gemma4-31B Reviewing Reasoning, context, math/logic Google Model Deployment NAI 2.6 comes with a catalog of 74 pre-validated models: 43 from Hugging Face Model Hub and 31 from the Nvidia NGC Catalog. There are also options to \u0026ldquo;Import Model using Hugging Face Model URL\u0026rdquo; and import models manually. Because Gemma 4 and Qwen 3.6 were newer than NAI 2.6, I\u0026rsquo;d have to use the Import Model from Hugging Face option to deploy those. Since gpt-oss-120b is a validated model, I deployed it first.\nDeploying a Validated Model NAI makes importing a validated model simple. From the Models page: Import Models → From Hugging Face Model Hub, search for gpt-oss-120b, pick the validated entry, give it an instance name, click Import, wait a bit for the download to complete.\nOnce the model status flipped to 🟢Ready, I was ready to create my first endpoint. NAI makes this easy too, especially with pre-validated models. Inference → Local Endpoints → Create Endpoint. Because gpt-oss-120b is on the validated list, NAI selects the inference engine, image, and most arguments automatically. I picked GPU Passthrough, NVIDIA L40S, 4 GPUs and 1 instance. The endpoint quickly deployed and showed as 🟢 Active.\nDeploying a Custom Model Deploying an endpoint using a custom model requires a few more choices than with a validated one, but NAI still makes it relatively straightforward. Since I was importing both from Hugging Face, the process was the same for Gemma and Qwen: Models → Import Models → From Hugging Face Model Hub → Import using Model URL → Enter info:\nModel Model URL Model Instance Name Qwen3.6-27B Qwen/Qwen3.6-27B-FP8 Qwen3.6-27B-FP8 gemma4-31B google/gemma-4-31B-it gemma-4-31b-it When both models showed 🟢Ready, I moved on to Endpoint Creation. There are two differences between creating an endpoint from a validated model vs. a custom one that you can, and should, take advantage of:\nSpecify your version of vLLM Specify custom arguments You\u0026rsquo;ll find both of these options on the second screen of the endpoint creation wizard: Engine Source → Import from community vLLM registry → specify Engine Tag\nEngine tag for Gemma: gemma4 Engine tag for Qwen: v0.19.0 Here are two of the custom args I used. More on this later under model tuning. In the endpoint creation wizard it is important to enter these exactly as you see in the table, don\u0026rsquo;t add quotes or prepend dashes or anything, NAI handles that for you.\nModel Key Value Explanation Both max-model-len 262144 Set context length Both max-num-seqs 8 Limit parallel flows to maximize context + KV cache with limited vRAM These endpoint deployments took slightly longer due to the custom engine downloads, but still flipped to 🟢 Active within a few minutes.\nFinal endpoint configuration:\nOpencode Integration + Tuning Once I had three endpoints running my three chosen models and successfully responding to NAI\u0026rsquo;s built-in test prompts, I moved on to configuring the models for use with Opencode. I used the add custom provider feature (Settings → Providers → Custom provider → + Connect) in Opencode to connect it to my NAI endpoints.\nFor Base URL it\u0026rsquo;s important to just use https://\u0026lt;URL\u0026gt;/enterpriseai/v1 as Opencode adds /chat/completions to whatever you enter. You can configure multiple models when you add the provider:\nIt\u0026rsquo;s also important that the model-id for each model you configure matches the endpoint\u0026rsquo;s \u0026ndash;served-model-name from the server (Endpoint Name in the NAI GUI).\nTuning / Troubleshooting Downloading models, deploying endpoints and connecting to them from Opencode felt almost too easy. Everything just worked, including my initial \u0026ldquo;hello world\u0026rdquo; type prompts from Opencode. When I started running some more complicated prompts things finally got a bit interesting. Each model needed at least one model-specific argument before it worked properly with Opencode, and gpt-oss-120b needed the most tuning.\nQwen3.6-27B-FP8 The most visible issue with Qwen was \u0026lt;think\u0026gt;...\u0026lt;/think\u0026gt; blocks showing up in the main response. Setting --reasoning-parser=qwen3 separates Qwen\u0026rsquo;s chain-of-thought into a reasoning_content field instead of letting it bleed into the main response.\nAnother issue I ran into with all three models was that tool calling didn\u0026rsquo;t work out of the box. I\u0026rsquo;ll cover the pattern here once since it applied to all three. Opencode sends tool_choice: \u0026quot;auto\u0026quot; to let the model decide when to invoke a tool, and vLLM rejects that unless --enable-auto-tool-choice is set with a matching --tool-call-parser. For Qwen the correct parser value is qwen3_coder.\ngemma-4-31b-it gemma-4-31b-it had the same reasoning-leak problem as Qwen. It also had a tool-calling problem where Opencode would ask Gemma to call a tool and get back a response that looked like a tool call but wasn\u0026rsquo;t structured as one, so the client couldn\u0026rsquo;t execute it. Gemma uses its own format for both reasoning and tool calls that none of the generic parsers handle, so it needed two model-specific arguments: --reasoning-parser=gemma4 and --tool-call-parser=gemma4 plus the same --enable-auto-tool-choice flag Qwen needed.\ngpt-oss-120b gpt-oss-120b was the most interesting with multiple rounds of issues. First, I had to apply the same tool calling and reasoning leak fixes as with Gemma and Qwen. This time that looked like:\n--enable-auto-tool-choice --tool-call-parser=openai --reasoning-parser=openai_gptoss This fixed single-turn tool calls but multi-turn tool calls, important for my agentic coding, were hanging. Anything that required the model to call a tool, see the result, and then call another tool would just stall. Opencode would send the follow-up request, vLLM would receive it, and then nothing, the client would just time out. At first I thought this was an Opencode issue but I was able to replicate it with curl. Digging into the vLLM GitHub issues, I found the bug: the version bundled with NAI\u0026rsquo;s validated gpt-oss image (nutanix/nai-vllm:v0.13.0-gpu) was dropping reasoning_content from streaming responses on assistant turns that had previously contained it. The fix was to upgrade the endpoint to the upstream vllm/vllm-openai:v0.20.0 image, which had the streaming fix. The image swap can\u0026rsquo;t be done from the NAI UI either — it\u0026rsquo;s another kubectl patch, included in the patch reference in the appendix below.\nI also added \u0026quot;interleaved\u0026quot;: true to my Opencode config for this provider so the client preserves reasoning_content across turns on its side too. In opencode.jsonc this lives under the provider\u0026rsquo;s options block:\n1 2 3 4 5 \u0026#34;nai-gpt-oss\u0026#34;: { \u0026#34;options\u0026#34;: { \u0026#34;interleaved\u0026#34;: true } } The validated endpoint also ships with --no-enable-prefix-caching (gpt-oss-120b is incompatible with prefix caching and vLLM refuses to start otherwise), --max-num-batched-tokens=10240, and --max-num-seqs=128. I left those in place: the 128 sequence ceiling is higher than the 8 I set on Qwen and Gemma because gpt-oss is the planner and tends toward many short concurrent requests rather than a few long-context ones.\nA Note on Endpoint Management with NAI As I mentioned in the deployment section, custom args can only be set at endpoint creation time, and not at all for validated endpoints. The official NAI workflow to change them is to redeploy the endpoint, and for validated models like gpt-oss-120b, that means importing a custom version from Hugging Face first so the custom-args fields become available.\nThis design makes sense for the common case. Updating vLLM or custom args triggers a redeployment anyway, so funneling everything through the creation wizard keeps the workflow consistent. The point of a validated catalog is that you get a tested, known-good config, so exposing those fields by default would undercut that guarantee. The tradeoff is that iterating on args is heavier than I wanted given how much tuning my agentic-coding use case needed, so I switched to patching the underlying ISVC directly with kubectl.\nThat immediately surfaced a new issue specific to that path: KServe defaults to a rolling update strategy where it creates a new ReplicaSet alongside the old one, waits for it to go ready, and then scales the old one back to 0. If you\u0026rsquo;re assigning all of your GPUs to endpoints like I am, the new RS can never go ready because it needs the GPUs the old one is holding onto, and the old one never gives them up because the new one isn\u0026rsquo;t ready. There are a few ways around this but the easiest is to scale the InferenceService to zero before patching:\n1 2 3 4 kubectl scale deployment qwen3-6-27b-predictor -n nai-admin --replicas=0 # wait for pod to terminate kubectl patch isvc ... # patch args kubectl scale deployment qwen3-6-27b-predictor -n nai-admin --replicas=1 The full set of patches I\u0026rsquo;m running today is in the appendix at the end of the post.\nConclusion Now I have three models from three lineages all working cleanly with Opencode and I\u0026rsquo;m ready to deploy my coding agents. Key lessons learned:\nEach model has its own preferences and your client will surface them quickly. Across these three I ended up with three different parser configurations and two different vLLM versions before everything worked cleanly with Opencode. The error modes are distinctive enough that once you\u0026rsquo;ve seen the pattern (think tags in responses, unstructured tool calls, hanging multi-turn requests) you\u0026rsquo;ll recognize what\u0026rsquo;s missing on the next model.\nNAI\u0026rsquo;s endpoint creation wizard is optimized to get you through your initial deployment as quickly and easily as possible. If you need to iterate on args and you\u0026rsquo;re comfortable with it, kubectl patch is the faster path.\nKServe\u0026rsquo;s default rolling update strategy doesn\u0026rsquo;t work for GPU-saturated single-replica endpoints. Scale the Deployment to zero before patching, otherwise you\u0026rsquo;ll end up with two ReplicaSets fighting for the same GPUs and you\u0026rsquo;ll be force-deleting pods to recover.\nStay tuned for the next post in this series, where I\u0026rsquo;ll pick up here with the three-model stack and the Opencode agent setup that actually puts the lineage separation to work.\nAppendix: Final Patches The below patches produce the exact config I\u0026rsquo;m running today. Remember to apply the scaling workaround above, or you\u0026rsquo;ll have to manually delete pods to complete the update.\nQwen3.6-27B (ISVC name: qwen3-6-27b):\n1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 kubectl patch isvc qwen3-6-27b -n nai-admin --type=json -p \u0026#39;[ {\u0026#34;op\u0026#34;: \u0026#34;replace\u0026#34;, \u0026#34;path\u0026#34;: \u0026#34;/spec/predictor/containers/0/args\u0026#34;, \u0026#34;value\u0026#34;: [ \u0026#34;--served-model-name=qwen3-6-27b\u0026#34;, \u0026#34;--enable-prefix-caching\u0026#34;, \u0026#34;--enable-prompt-tokens-details\u0026#34;, \u0026#34;--enable-auto-tool-choice\u0026#34;, \u0026#34;--tool-call-parser=qwen3_coder\u0026#34;, \u0026#34;--reasoning-parser=qwen3\u0026#34;, \u0026#34;--trust-remote-code\u0026#34;, \u0026#34;--max-num-seqs=8\u0026#34;, \u0026#34;--max-model-len=262144\u0026#34;, \u0026#34;--tensor-parallel-size=4\u0026#34;, \u0026#34;--gpu-memory-utilization=0.90\u0026#34; ]} ]\u0026#39; gemma-4-31b-it:\n1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 kubectl patch isvc gemma-4-31b-it -n nai-admin --type=json -p \u0026#39;[ {\u0026#34;op\u0026#34;: \u0026#34;replace\u0026#34;, \u0026#34;path\u0026#34;: \u0026#34;/spec/predictor/containers/0/args\u0026#34;, \u0026#34;value\u0026#34;: [ \u0026#34;--served-model-name=gemma-4-31b-it\u0026#34;, \u0026#34;--dtype=bfloat16\u0026#34;, \u0026#34;--enable-prefix-caching\u0026#34;, \u0026#34;--enable-prompt-tokens-details\u0026#34;, \u0026#34;--enforce-eager\u0026#34;, \u0026#34;--max-num-seqs=8\u0026#34;, \u0026#34;--limit-mm-per-prompt={\\\u0026#34;image\\\u0026#34;: 1}\u0026#34;, \u0026#34;--tensor-parallel-size=4\u0026#34;, \u0026#34;--gpu-memory-utilization=0.90\u0026#34;, \u0026#34;--max-model-len=262144\u0026#34;, \u0026#34;--enable-auto-tool-choice\u0026#34;, \u0026#34;--tool-call-parser=gemma4\u0026#34;, \u0026#34;--reasoning-parser=gemma4\u0026#34; ]} ]\u0026#39; gpt-oss-120b — two patches, image and args:\n1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 # Image swap (fixes the multi-turn streaming bug) kubectl patch isvc gpt-oss-120b -n nai-admin --type=json -p \u0026#39;[ {\u0026#34;op\u0026#34;: \u0026#34;replace\u0026#34;, \u0026#34;path\u0026#34;: \u0026#34;/spec/predictor/containers/0/image\u0026#34;, \u0026#34;value\u0026#34;: \u0026#34;vllm/vllm-openai:v0.20.0\u0026#34;} ]\u0026#39; # Args kubectl patch isvc gpt-oss-120b -n nai-admin --type=json -p \u0026#39;[ {\u0026#34;op\u0026#34;: \u0026#34;replace\u0026#34;, \u0026#34;path\u0026#34;: \u0026#34;/spec/predictor/containers/0/args\u0026#34;, \u0026#34;value\u0026#34;: [ \u0026#34;--served-model-name=gpt-oss-120b\u0026#34;, \u0026#34;--dtype=bfloat16\u0026#34;, \u0026#34;--no-enable-prefix-caching\u0026#34;, \u0026#34;--enable-prompt-tokens-details\u0026#34;, \u0026#34;--max-num-batched-tokens=10240\u0026#34;, \u0026#34;--max-num-seqs=128\u0026#34;, \u0026#34;--max-model-len=131072\u0026#34;, \u0026#34;--enable-auto-tool-choice\u0026#34;, \u0026#34;--tool-call-parser=openai\u0026#34;, \u0026#34;--reasoning-parser=openai_gptoss\u0026#34;, \u0026#34;--tensor-parallel-size=4\u0026#34; ]} ]\u0026#39; After applying the gpt-oss image patch, verify the pod came up with the new image:\n1 2 3 kubectl get pods -n nai-admin -l endpoint=gpt-oss-120b \\ -o jsonpath=\u0026#39;{.items[0].spec.containers[0].image}\u0026#39; # Expected: vllm/vllm-openai:v0.20.0 ","permalink":"https://sl-notes.dev/posts/pure-nai-2/","summary":"\u003ch2 id=\"introduction\"\u003eIntroduction\u003c/h2\u003e\n\u003cp\u003eThis is part 2 in an ongoing series about building an open-source agentic coding platform. \u003ca href=\"https://sl-notes.dev/posts/pure-nai/\"\u003eIn part one\u003c/a\u003e I covered deploying Nutanix Enterprise AI 2.6 on bare-metal NKP with Pure FlashArray storage via Portworx CSI. In this post I\u0026rsquo;ll review model selection, deployment, tuning and integration with Opencode. Future posts will cover more on the agentic coding setup.\u003c/p\u003e\n\u003cp\u003eThis one is denser than part 1 but by the end you\u0026rsquo;ll have three production-tuned models, an Opencode integration, and a small arsenal of \u003ccode\u003ekubectl patch\u003c/code\u003e tricks to streamline endpoint management.\u003c/p\u003e","title":"Adventures in Model Deployment and Tuning with Nutanix Enterprise AI [OSS Agentic Coding Part 2]"},{"content":"I\u0026rsquo;ve worked with Nutanix Enterprise AI (NAI) a lot over the last few months. I\u0026rsquo;ve deployed it across several Nutanix Kubernetes Platform (NKP) architectures: VMs on Nutanix HCI, VMs on Nutanix Cloud Platform with External Storage and bare-metal Ubuntu backed by Everpure. This post is about the last, the most unusual of the three and the platform for my broadest set of experiments.\nThis will be the first in a series of posts about my work with this cluster. In this post I will focus on the architecture and initial set-up. Later I will cover post-deployment activities including break/fix troubleshooting, model deployment, tuning and other tips and trick. In the final post I will cover building up an open-source gstack-style agentic coding setup driven by OpenCode and leveraging Qwen3.6-27B-FP8, gemma-4-31B-it and gpt-oss-120b running on NAI.\nWhy did I choose this architecture? I needed to test NAI + NKP + GPU pass-through. The only GPU nodes I had available didn\u0026rsquo;t have M.2 boot drives but I did have access to an Everpure array. Since NKP works with bare-metal Linux, and NAI just needs kubernetes, this seemed like sufficient hardware to make it work.\nThe Architecture This is what I built. NKP 2.17 is deployed as a converged management pod with the control plane nodes running as VMs on another lab cluster and the GPU hosts as workers. NAI 2.6.0 was deployed from the NKP Applications marketplace which was slick. Storage is a single Pure FlashArray, accessed via the Portworx CSI (PX-CSI) driver in two modes: FlashArray Direct Access for RWO block volumes (PostgreSQL, ClickHouse, Prometheus) and FA File Services for the RWX NFS share that holds the models.\nDuring this deployment I learned a few lessons related to the storage configuration. I will cover that part in detail so I remember for next time and so others can leverage those lessons while performing similar deployments.\nChoosing a CSI driver My past NAI deployments all used the Nutanix CSI driver which works great\u0026hellip; if you have Nutanix storage. I wasn\u0026rsquo;t sure what to use this time. Longhorn CSI was installed to access local storage from the nodes, but research (ie testing) showed it isn\u0026rsquo;t a generic CSI either and is for Longhorn storage specifically. LLMs (Gemini, ChatGPT and Claude Opus) all told me to use Pure Service Orchestrator (PSO), and Google agreed. I tried PSO but it didn\u0026rsquo;t work, and this didn\u0026rsquo;t seem like a situation where I should have to spend a lot of time troubleshooting. I asked one more LLM, Grok this time. It\u0026rsquo;d recently come out with a new \u0026ldquo;society of mind\u0026rdquo; multi-agent architecture and it finally cleared things up: Pure deprecated PSO in favor of PX-CSI after the Portworx acquisition. This was only a couple months ago but all LLMs have released 1-2 minor version updates since then so your results may vary. Then again, the first two organic Google hits for \u0026ldquo;pure csi\u0026rdquo; are still PSO, so, maybe not:\nConfiguring PX-CSI for NAI Here is what I had to do to successfully configure PX-CSI for unified block and file (UBF) for NAI:\nInstall PX-CSI following the instructions Force PX-CSI to quit trying to create FlashBlade StorageClasses: PX-CSI\u0026rsquo;s autodiscovery appears to have a baked-in assumption that if you\u0026rsquo;re doing NFS, you have a FlashBlade, and auto-creates storage classes for px-fb-direct-access-nfsv3 and px-fb-direct-access-nfsv4 which kept getting set as default for NFS. To get around this I disabled the autodiscover feature:\n1 2 3 kubectl delete sc px-fb-direct-access-nfsv3 px-fb-direct-access-nfsv4 kubectl edit storagecluster -n portworx px-cluster # add under metadata.annotations: portworx.io/disable-storage-class: \u0026#34;true\u0026#34; Manually create FA StorageClasses (SC) Now that auto-discover was disabled I needed to manually create my two FA SCs:\n1 2 3 4 5 6 7 8 9 10 11 12 apiVersion: storage.k8s.io/v1 kind: StorageClass metadata: name: px-fa-direct-access annotations: storageclass.kubernetes.io/is-default-class: \u0026#34;true\u0026#34; provisioner: pxd.portworx.com parameters: backend: pure_block allowVolumeExpansion: true reclaimPolicy: Delete volumeBindingMode: Immediate I made px-fa-direct-access the default SC.\n1 2 3 4 5 6 7 8 9 10 11 12 apiVersion: storage.k8s.io/v1 kind: StorageClass metadata: name: px-fa-nfs provisioner: pxd.portworx.com parameters: backend: \u0026#34;pure_fa_file\u0026#34; pure_fa_file_system: \u0026#34;lab-01-nfs\u0026#34; pure_nfs_policy: \u0026#34;lab-01nfs-export\u0026#34; allowVolumeExpansion: true reclaimPolicy: Delete volumeBindingMode: Immediate Create SCs (plural) required for NAI The NAI deployment for NKP are very explicit about the requirement for a storage class named \u0026ldquo;nai-nfs-storage:\u0026rdquo; so I created a copy of px-fa-nfs with that name. Less obvious, but to be fair still documented in the \u0026ldquo;Nutanix Enterprise AI Configuration Parameters for the nai-operators Helm Chart\u0026rdquo; section of the documentation is that Clickhouse Keeper and Clickhouse Server both have a storageClass parameter that defaults to \u0026ldquo;nutanix-volume.\u0026rdquo; I found this out when ClickHouse PVCs stayed Pending with error: \u0026lsquo;failed to create Directory (400): Msg1: File system does not exist.\u0026rsquo; after I tried to enable NAI. I resolved this by creating a copy of the px-fa-direct-access SC called nutanix-volume but theoretically I also could\u0026rsquo;ve updated nai-clickhousekeeper.clickhouseKeeper.storage.storageClass and nai-clickhouseserver.clickhouse.storage.storageClass but I didn\u0026rsquo;t want to run into any more surprises. I suspect this SC is created by default when you run NKP on NCP.\nSo finally, I had this: kubectl get sc\n1 2 3 4 5 NAME PROVISIONER RECLAIMPOLICY VOLUMEBINDINGMODE ALLOWVOLUMEEXPANSION AGE nai-nfs-storage pxd.portworx.com Delete Immediate true 71d nutanix-volume pxd.portworx.com Delete Immediate true 71d px-fa-direct-access (default) pxd.portworx.com Delete Immediate true 71d px-fa-nfs pxd.portworx.com Delete Immediate true 71d Then I deleted the pending pods and the install completed:\nkubectl delete pod -n nai-system --field-selector=status.phase=Pending --force --grace-period=0\nWhile working on CSI, I noticed the iSCSI interfaces on Everpure were set to use Jumbo Frames:\nI decided to match the MTU by updating my worker nodes to 9000 to align to Pure\u0026rsquo;s performance recommendation.\nUntil next time\u0026hellip; Bare-metal NKP + external storage works great for NAI, but the NAI on NKP deployment instructions may assume NKP on NCP, so it might be worth your time to double check the NAI AI Configuration Parameters before you start and pay close attention to your storage classes.\nIn a later post in this series, I\u0026rsquo;ll cover model deployment, endpoint deployment and post-deployment tasks using both pre-validated models and the \u0026ldquo;Import Models \u0026gt; From Hugging Face Model Hub\u0026rdquo; feature — see part 2 for those details.\n","permalink":"https://sl-notes.dev/posts/pure-nai/","summary":"\u003cp\u003eI\u0026rsquo;ve worked with \u003ca href=\"https://www.nutanix.com/products/nutanix-enterprise-ai\"\u003eNutanix Enterprise AI\u003c/a\u003e (NAI) a lot over the last few months. I\u0026rsquo;ve deployed it across several \u003ca href=\"https://www.nutanix.com/products/kubernetes-management-platform\"\u003eNutanix Kubernetes Platform\u003c/a\u003e (NKP) architectures: VMs on Nutanix HCI, VMs on \u003ca href=\"https://www.nutanix.com/tech-center/blog/nutanix-cloud-platform-with-external-storage\"\u003eNutanix Cloud Platform with External Storage\u003c/a\u003e and bare-metal Ubuntu backed by Everpure. This post is about the last, the most unusual of the three and the platform for my broadest set of experiments.\u003c/p\u003e\n\u003cp\u003eThis will be the first in a series of posts about my work with this cluster. In this post I will focus on the architecture and initial set-up. Later I will cover post-deployment activities including break/fix troubleshooting, model deployment, tuning and other tips and trick. In the final post I will cover building up an open-source \u003ca href=\"https://github.com/garrytan/gstack\"\u003egstack-style\u003c/a\u003e agentic coding setup driven by OpenCode and leveraging Qwen3.6-27B-FP8, gemma-4-31B-it and gpt-oss-120b running on NAI.\u003c/p\u003e","title":"Notes from Deploying NAI 2.6 on Bare-Metal NKP with Everpure Storage [OSS Agentic Coding Part 1]"},{"content":"Problem When planning Nutanix deployments, the official Sizer (sizer.nutanix.com) is the most accurate tool. However, many situations only need a quick napkin-level estimate rather than full precision.\nNutanix Sizer includes a Storage Capacity Calculator, but it didn\u0026rsquo;t cover exactly what was needed for rapid scenarios.\nThe Solution I built a simple web-based Nutanix Storage Calculator that delivers fast, practical storage sizing estimates for Nutanix HCI.\nIt lets you quickly test various scenarios with the level of detail useful for initial planning or team discussions.\nTry it here: Nutanix Storage Calculator\nNote: This is based on practical experience and is not official. Always validate final designs in the full Nutanix Sizer before ordering hardware.\nThe calculator is part of ongoing efforts to simplify common Nutanix planning tasks.\n","permalink":"https://sl-notes.dev/posts/nutanix-storage-calculator/","summary":"\u003ch2 id=\"problem\"\u003eProblem\u003c/h2\u003e\n\u003cp\u003eWhen planning Nutanix deployments, the official Sizer (sizer.nutanix.com) is the most accurate tool. However, many situations only need a quick napkin-level estimate rather than full precision.\u003c/p\u003e\n\u003cp\u003eNutanix Sizer includes a Storage Capacity Calculator, but it didn\u0026rsquo;t cover exactly what was needed for rapid scenarios.\u003c/p\u003e\n\u003ch2 id=\"the-solution\"\u003eThe Solution\u003c/h2\u003e\n\u003cp\u003eI built a simple web-based \u003cstrong\u003eNutanix Storage Calculator\u003c/strong\u003e that delivers fast, practical storage sizing estimates for Nutanix HCI.\u003c/p\u003e\n\u003cp\u003eIt lets you quickly test various scenarios with the level of detail useful for initial planning or team discussions.\u003c/p\u003e","title":"Nutanix Storage Calculator"},{"content":"Sam\u0026rsquo;s Tech Notes shares technical explainers, short industry takes, and notes on products or tools worth paying attention to. Content will be fairly technical but accessible where possible.\nAbout the Author I am a Principal Technologist at Expedient where I focus on product and strategy for the \u0026ldquo;Powered by Nutanix\u0026rdquo; cloud services line. I\u0026rsquo;ve worked in the industry for over 20 years and am currently serving as a member of the Nutanix Technology Champion program.\nI also occasionally post Expedient-specific content on the Expedient Blog.\nHow to follow along Bookmark, or subscribe via RSS.\n","permalink":"https://sl-notes.dev/posts/hello-world/","summary":"\u003cp\u003eSam\u0026rsquo;s Tech Notes shares technical explainers, short industry takes, and notes on products or tools worth paying attention to. Content will be fairly technical but accessible where possible.\u003c/p\u003e\n\u003ch2 id=\"about-the-author\"\u003eAbout the Author\u003c/h2\u003e\n\u003cp\u003eI am a Principal Technologist at Expedient where I focus on product and strategy for the \u0026ldquo;Powered by Nutanix\u0026rdquo; cloud services line. I\u0026rsquo;ve worked in the industry for over 20 years and am currently serving as a member of the Nutanix Technology Champion program.\u003c/p\u003e","title":"Sam's Tech Notes"},{"content":"Sam\u0026rsquo;s Tech Notes shares technical explainers, short industry takes, and notes on products or tools worth paying attention to. Content will be fairly technical but accessible where possible.\nAbout the Author I am a Principal Technologist at Expedient where I focus on product and strategy for the \u0026ldquo;Powered by Nutanix\u0026rdquo; cloud services line. I\u0026rsquo;ve worked in the industry for over 20 years and am currently serving as a member of the Nutanix Technology Champion program.\nI also occasionally post Expedient-specific content on the Expedient Blog.\nHow to follow along Bookmark, or subscribe via RSS.\n","permalink":"https://sl-notes.dev/about/","summary":"About Sam Larson","title":"About"}]