NVIDIA’s 550B Open Model Hits 300 Tokens/Second. The Benchmark Numbers Are Almost Irrelevant.

Nemotron 3 Ultra drops with 550B total parameters, 55B live via Mixture-of-Experts, and a 300 tokens/sec throughput claim that rewrites what’s economically viable for production agents.

Key Takeaways

– Nemotron 3 Ultra, 550B total parameters, 55B active at once via Mixture-of-Experts architecture. OpenMDW license through the Linux Foundation. Free to download, no licensing fees.
– 300 tokens/sec on OpenRouter, 5.9x faster than GLM-5.1-754B and 4.8x faster than Kimi K2.6 in NVIDIA’s head-to-head benchmarks.
– Full weights, training data, and training recipes released June 4, 2026.
– NVIDIA claims 30% cost savings on agentic workloads. Real, but only on NVIDIA silicon. NemoClaw and OpenShell are where the enterprise business lives.

—

Here’s what everyone’s getting wrong about this launch.

The headlines say: “Top US open model. 48 on the Artificial Analysis Intelligence Index. Sits 6 points behind Kimi K2.6.” And sure, that’s accurate.

But if you’re running production agents right now, benchmark obsession is missing the actual plot.

What Nemotron 3 Ultra Actually Delivers

550 billion parameters total. 55 billion active at any moment.

NVIDIA dropped it June 4 with the whole stack open.

Weights, training data, training recipes. All of it under the Linux Foundation’s OpenMDW license. Download it. Run it. No licensing fees, no gatekeepers.

The benchmarks hold up too.

RULER score of 94.7 at 1M token context.

SWE-Bench Verified at 71.9. Handles English, French, Spanish, German, Japanese, Korean, Hindi, Portuguese. And Chinese, though NVIDIA’s own model card notes primary use cases are English and code.

You can go test it on OpenRouter right now. Takes maybe ten minutes.

The Numbers Worth Tracking in Real Pipelines

Here’s what gets me excited. NVIDIA’s benchmarks show 5.9x higher inference throughput than GLM-5.1-754B-A40B. 4.8x higher than Kimi-K2.6-1T-A32B. 1.6x higher than Qwen-3.5-397B-17B. All measured in an 8k-token input / 64k-token output setting. That’s the regime driving agentic pipelines, not quick one-shot prompts.

| Model | Relative Throughput |
|—|—|
| Nemotron 3 Ultra 550B | 1.0x (baseline) |
| GLM-5.1-754B-A40B | 0.17x |
| Kimi-K2.6-1T-A32B | 0.21x |
| Qwen-3.5-397B-17B | 0.63x |

300 tokens per second on OpenRouter.

Right now. Five times faster than comparable open models in head-to-heads.

Here’s the contrarian angle nobody’s leading with: for small operators running automated pipelines, speed beats smarts every time. I’ve watched this play out on client work over and over. You’re not building a chatbot. You’re running document processing, CRM automation, multi-step research. Latency compounds. Throughput compounds. Getting a result in 4 seconds instead of 20 doesn’t just feel better.

It changes what you can charge for the task and how many calls you can squeeze inside a rate limit window.

NVIDIA claims up to 30% lower cost for complex agentic workloads.

That’s their promotional material, so yeah, grain of salt. But the direction tracks: a faster model handling the same agentic tasks is cheaper per completed job. For a solo operator running hundreds of agent calls daily, that’s not nothing.

Why Throughput Beats Benchmark Scores in Production

The benchmark story, 48 on the Artificial Analysis Intelligence Index, 6 points behind Kimi K2.6. Is the wrong frame for production decisions.

Palantir and Siemens aren’t touching this for leaderboard positioning.

They’re running it in workflows that generate revenue by completing tasks faster. The throughput advantage is the product. When your agent pipeline handles customer service queries, supply chain decisions, or IT security triage, every millisecond of latency is money.

Compound that across hundreds of thousands of daily calls and the intelligence gap becomes secondary to the speed gap.

For small businesses and indie operators, this is the part worth sitting with.

Running Llama 70B and hitting throughput walls on agentic tasks. Multi-step reasoning, tool use, long-horizon planning? The jump to 300 tokens per second changes what’s economically viable to automate. Currently paying for GPT-4.5-class access and watching token bills climb? A locally hostable 550B model with this throughput profile is worth a serious cost analysis.

One caveat: if you’re running on AMD or Intel GPUs, don’t assume this is a free win. More on that in a second.

Where the Actual Business Lives

NVIDIA gave the model away. That’s the headline. The actual business is in the enterprise Agent Toolkit. NemoClaw and OpenShell.

The model is free under an open license. NemoClaw and OpenShell are not. They’re built for NVIDIA’s stack. The companies already announced as customers. Cadence, CrowdStrike, Palantir, Siemens, Foxconn. Are enterprise shops with NVIDIA infrastructure already in place. For them, the math is simple: download the weights, tune on proprietary data, run on hardware they already own, pay nothing in licensing.

For operators not already invested in NVIDIA infrastructure, the picture gets more complicated.

The speed and cost advantages NVIDIA cites are measured on their hardware.

Running Nemotron 3 Ultra on different GPU architectures is technically possible but produces other performance characteristics. The open weights are real. The open-source freedom is partially real. The enterprise toolkit is not cross-platform by design.

That’s not a knock on the model. Just the actual trade-off. Download it, run it, benchmark it on your own hardware. Then decide.

What Small Operators Should Do Next

Running agentic pipelines today and watching API costs climb? Test Nemotron 3 Ultra on OpenRouter inference benchmarks this week. Free.

No excuse not to.

Run it through your actual workload. Measure tokens per second and cost per completed task against whatever you’re using now.

If the numbers work. And for high-throughput agentic tasks, they might. You have a real decision to make about where your agents live. Locally hosted open weights eliminate per-token costs entirely.

The GPU investment pays back over time against ongoing API spend.

The lock-in risk is real if you’re considering NemoClaw or OpenShell.

Evaluate those tools on their merits for your stack, not as a bundle deal with the model.

The era of “open model = worse model” is over. Nemotron 3 Ultra is proof. What you do with that is the interesting part.