NVIDIA launched Nemotron 3 Super at GTC on March 11, 2026, and the benchmark that caught the AI research community’s attention wasn’t just a model score—it was a throughput number. At 5x the inference speed of the previous Nemotron Super with 2.2x higher throughput than competing 120B-class models, this is the first open-source model that makes running a 120 billion parameter model feel like a 12 billion parameter model at inference time. That’s not marketing. That’s the architecture doing its job.
Rating: 8.2/10 ⭐⭐⭐⭐
What Is Nvidia Nemotron 3 Super?
Nemotron 3 Super is an open-source large language model developed by NVIDIA Corporation and released on March 11, 2026. It sits in the middle of the Nemotron 3 family: above Nano (30B total / 3B active) and below the upcoming Ultra (~500B total / ~50B active). The full parameter count is 120B, but through Mixture-of-Experts routing, only 12B parameters activate per token—delivering frontier-class reasoning at mid-tier inference cost.
Its one-line differentiator: a hybrid Mamba-Transformer MoE model with a native 1 million token context window, optimized from the ground up for multi-agent, long-horizon AI systems. Weights, training data (10 trillion tokens), and training recipes are all public. It’s available on Hugging Face and via NVIDIA NIM.
The Story: Why 5x Throughput Changes the Economics of Agentic AI
Multi-agent AI systems have a dirty secret: they generate up to 15x more tokens than standard chat interactions. Every sub-agent call re-sends history, tool outputs, and reasoning chains. That “context explosion” makes deploying large reasoning models at scale prohibitively expensive—NVIDIA calls it the “thinking tax.”
Nemotron 3 Super is engineered specifically to eliminate that tax. Three architectural decisions make it possible:
1. Hybrid Mamba-Transformer Backbone. Mamba-2 state space model layers handle the bulk of sequence processing at linear time complexity—so a 1 million token context doesn’t blow up memory. Transformer attention layers are interleaved at key depths to preserve precise associative recall (the “needle in a haystack” problem). Combined, the architecture achieves 4x better memory and compute efficiency versus pure Transformer at the same parameter count.
2. Latent MoE. Before tokens reach expert layers, they’re compressed into a low-rank latent space. Expert computation happens in this smaller dimension, then projects back. The result: Nemotron 3 Super can call 4x as many expert specialists for the same inference cost as a standard MoE. Granular specialization—Python syntax experts, SQL logic experts, cybersecurity triage experts—at zero added cost per inference.
3. Multi-Token Prediction (MTP). Instead of predicting one token per forward pass, MTP heads forecast multiple future tokens simultaneously. This provides up to a 3x wall-clock speedup on long sequences and acts as built-in speculative decoding without a separate draft model.
The practical outcome: on PinchBench—a benchmark specifically measuring LLM performance as an autonomous agent brain—Nemotron 3 Super scores 85.6%, ranking as the best open model in its class as of launch.
Benchmark Performance
Here’s how Nemotron 3 Super performs against top open-source competition across standard evaluation suites. Note that NVIDIA’s benchmarks are self-reported; third-party reproduction is ongoing as of March 2026.
| Benchmark | Nemotron 3 Super (120B/12B active) |
Qwen 3.5 122B | DeepSeek V3 | Meta Llama 4 |
|---|---|---|---|---|
| MMLU-Pro | 83.73% | 86.70% | 81.6% | ~80% |
| GPQA (no tools) | 79.23% | 86.60% | 73.3% | ~75% |
| GPQA (with tools) | 82.70% | N/A | N/A | N/A |
| SWE-bench Verified | 60.47% | ~66% | 49.2% | ~55% |
| HumanEval | 79.40% | ~82% | 82.6% | ~78% |
| PinchBench (Agentic) | 85.6% | N/A | N/A | N/A |
| Throughput vs 120B-class | 2.2x faster | 1x baseline | ~1.1x | ~1x |
| Context Window | 1M tokens | 128K tokens | 128K tokens | 128K tokens |
Sources: NVIDIA Technical Report, llm-stats.com, Artificial Analysis. Competitor figures are independently reported; some are approximate as of March 2026.
The headline: Nemotron 3 Super isn’t the outright accuracy leader (Qwen 3.5 122B outscores it on MMLU-Pro and GPQA without tools), but when you factor in throughput and the 1M context window, it’s the only open model that can run agentic workloads at frontier-class quality without bankrupting your inference budget.
Pricing
| Access Method | Cost | Notes |
|---|---|---|
| Open Weights (HuggingFace) | Free | Self-host. Min 2x H100-80GB. NVIDIA Nemotron Open License. |
| OpenRouter (free tier) | Free (rate-limited) | nvidia/nemotron-3-super-120b-a12b:free |
| NVIDIA NIM API | TBD | Pricing not announced at launch. Check build.nvidia.com. |
| Google Cloud | Cloud GPU rates | Available now via Vertex AI / Model Garden. |
| Oracle Cloud | Cloud GPU rates | Available at launch. AWS/Azure coming. |
| Model | Open Weights | API Pricing (Input/1M tokens) |
|---|---|---|
| Nemotron 3 Super | ✅ Free | TBD |
| Meta Llama 4 | ✅ Free | ~$0.40–$0.60 (via cloud APIs) |
| Mistral Large 2 | ✅ Free | ~$2.00 (Mistral API) |
| DeepSeek V3 | ✅ Free | $0.27 (DeepSeek API) |
Note: NIM pricing opacity is a genuine concern covered in the controversy section below.
Key Features
1. Native 1 Million Token Context Window
This isn’t a “theoretical” context window achieved through chunking tricks. Mamba-2 layers provide linear-time sequence complexity, making 1M tokens computationally tractable. For agentic AI, this means an agent can hold an entire codebase, full conversation history, and retrieved document sets in working memory simultaneously. The limitation: maximum context performance requires sufficient GPU memory—running 1M context on minimum hardware (2x H100-80GB) will constrain batch sizes significantly.
2. Controllable Reasoning (ON / Low Effort / OFF)
Via the enable_thinking parameter in the chat template, users can dial reasoning intensity to match task complexity. Full reasoning ON for multi-step coding and analysis; Low Effort for moderate complexity; OFF for direct conversational responses. This matters economically—unnecessary chain-of-thought on simple queries burns tokens. The limitation: the Low Effort mode’s quality ceiling compared to full reasoning hasn’t been independently benchmarked at scale yet.
3. NVFP4 Native Pretraining
Nemotron 3 Super is the first model pretrained natively in NVIDIA’s 4-bit floating-point format. On Blackwell B200 GPUs, this delivers 4x faster inference than FP8 on H100—not just quantization, but native precision. The limitation: NVFP4 is exclusive to Blackwell-generation hardware. If you’re running H100 or older, you’re using the FP8 variant at roughly standard speeds. This is the GPU lock-in issue (see Controversy section).
4. Fully Open Training Stack
NVIDIA released not just weights but the complete training recipe: 10 trillion unique curated tokens (public dataset collections on HuggingFace), NeMo RL post-training code, and NeMo Gym configurations for the 21-environment reinforcement learning setup. Organizations can reproduce, fine-tune, or build custom variants from first principles. Limitation: training infrastructure requirements are substantial—reproducing the full training run requires significant Blackwell cluster access.
5. Multi-Language Support
English, French, German, Italian, Japanese, Spanish, and Chinese are natively supported. This expands the viable deployment base beyond English-only enterprise use cases. SWE-bench Multilingual scores (45.78%) are significantly lower than English SWE-bench (60.47%), signaling that non-English coding tasks see a real quality drop.
6. Enterprise Early Adoption
Perplexity, Palantir, Cadence, and Siemens were live on Nemotron 3 Super at launch—not a beta program, production deployments. Perplexity’s use case (high-volume inference at scale) is the strongest validation signal. Palantir’s adoption (government/defense AI) suggests the model passed security and capability reviews most open models never see. Limitation: enterprise customer case studies with measurable outcomes aren’t public yet.
Who Is It For / Who Should Look Elsewhere
Use Nemotron 3 Super if you:
- Run multi-agent AI pipelines at scale where inference cost and throughput are primary constraints
- Need a 1 million token context window for codebase analysis, long-document RAG, or enterprise knowledge retrieval
- Have or are acquiring NVIDIA Blackwell (B200) hardware and want to extract maximum performance
- Are a researcher who needs an open training stack to study, reproduce, or extend cutting-edge agentic AI
- Build enterprise AI applications in cybersecurity, software development automation, or IT operations
Look elsewhere if you:
- Are a casual user or developer without dedicated GPU infrastructure—this model has no consumer product
- Need pure benchmark accuracy above throughput (Qwen 3.5 122B outscores on MMLU-Pro and GPQA)
- Are running on non-NVIDIA hardware (AMD ROCm, Apple Silicon)—NVFP4 optimizations are Blackwell-exclusive
- Need multimodal capabilities—Nemotron 3 Super is text-only
Comparison Table
| Feature | Nemotron 3 Super | Meta Llama 4 | Mistral Large 2 | DeepSeek V3 |
|---|---|---|---|---|
| Total Parameters | 120B (12B active) | ~70B | 123B | 671B (37B active) |
| Architecture | Hybrid Mamba-Transformer + Latent MoE | Dense Transformer | Dense Transformer | MoE Transformer |
| Context Window | 1M tokens | 128K tokens | 128K tokens | 128K tokens |
| Open Weights | ✅ Yes | ✅ Yes (Llama license) | ✅ Yes (MRL) | ✅ Yes (MIT) |
| License | NVIDIA Nemotron Open Model License | Llama Community License | Mistral Research License | MIT |
| API Pricing (per 1M in tokens) | TBD (NIM) | ~$0.40–0.60 | ~$2.00 | $0.27 |
| Best For | Agentic AI, long-context, high-throughput | General purpose, broad ecosystem | European data residency, multilingual | Cost-efficient inference, coding |
| Min Hardware | 2x H100-80GB | 1x A100-80GB (quantized) | 2x A100-80GB | 4x H100-80GB |
| Multimodal | ❌ Text only | ✅ Text + Image | ❌ Text only | ❌ Text only |
| SWE-bench Verified | 60.47% | ~55% | ~44% | 49.2% |
| Controllable Reasoning | ✅ ON / Low / OFF | ❌ | ❌ | ✅ Extended Thinking (V3) |
| Cloud Availability | Google Cloud, Oracle (AWS/Azure soon) | All major clouds | All major clouds | DeepSeek API + major clouds |
| Training Data | 25T tokens (10T public) | Undisclosed | Undisclosed | Undisclosed |
Controversy: What They Don’t Advertise
The GPU Lock-In Problem
NVFP4 is the headline performance claim—4x faster on B200 vs FP8 on H100. But read the fine print: NVFP4 is exclusive to NVIDIA Blackwell architecture. If you’re running a current H100 cluster (the most common enterprise GPU deployment as of 2026), you get the FP8 variant at standard H100 speeds. The “5x throughput” headline improvement is largely an architectural upgrade plus the B200-specific NVFP4 gains compounded together. Organizations on pre-Blackwell hardware get a genuinely better model, but not the full marketing headline. This isn’t disclosed prominently in NVIDIA’s launch materials.
NVIDIA’s Vertical Integration Play
Think about the positioning: NVIDIA releases a state-of-the-art open model that performs best on NVIDIA hardware (Blackwell), distributed via NVIDIA NIM API (with undisclosed pricing), accessible through NVIDIA-preferred cloud partners (Google Cloud, Oracle), using NVIDIA’s NeMo training stack, and benchmarked on a new benchmark (PinchBench) where it scores highest. Every layer of the stack—chips, software, distribution, evaluation—is NVIDIA. The “open” model is genuinely open in terms of weights and training data, but the ecosystem moat is pure NVIDIA. Whether this is a strategic lock-in play or just good vertical integration depends on your tolerance for single-vendor dependency.
Open Weights ≠ Open Source
The NVIDIA Nemotron Open Model License is not an OSI-approved open-source license. Commercial use requires reading the fine print carefully—particularly around distribution, derivative models, and attribution. This contrasts with DeepSeek V3 (MIT license) and is more restrictive than the Llama Community License for many use cases. The marketing calls it “fully open” but the legal reality is a custom NVIDIA license. For enterprise legal teams, this requires review before production deployment.
NIM Pricing Opacity
NVIDIA announced the model but not the NIM API pricing. For developers who want managed inference without the GPU infrastructure overhead, this is a genuine blocker. The pattern of releasing a major model without API pricing is unusual—and it creates friction for the exact enterprise audience NVIDIA is targeting. Perplexity and Palantir presumably have volume contracts; SMB developers don’t know what they’re signing up for.
Pros and Cons
Pros:
- 5x throughput improvement over previous Nemotron Super—genuinely transformative for multi-agent deployments
- 1M native context window—largest in the open-source 120B parameter class by a wide margin
- Fully open training stack—weights, 10T token dataset, training recipes, NeMo Gym configs all public
- NVIDIA enterprise backing—Perplexity, Palantir, Cadence, Siemens live at launch (real-world validation)
- Controllable reasoning modes—ON/Low Effort/OFF lets you optimize token spend per task type
- Best-in-class agentic benchmark—85.6% on PinchBench as the leading open agentic model at launch
- Multilingual support—7 languages natively, unlike many open models that are English-first
Cons:
- No consumer product—purely a model for developers and enterprises with GPU infrastructure
- NVFP4 is Blackwell-exclusive—max performance requires next-gen GPU hardware most orgs don’t have yet
- NIM API pricing undisclosed—you can’t budget for managed inference without knowing costs
- Not pure OSI open source—NVIDIA Nemotron License requires careful legal review for commercial use
- Below Qwen 3.5 on accuracy benchmarks—MMLU-Pro (83.73% vs 86.70%) and GPQA (79.23% vs 86.60%) without tools
- Text-only—no vision, no audio; competitors like Llama 4 are multimodal
Getting Started
Option 1: Self-host via Hugging Face (FP8 variant, 2x H100-80GB minimum)
# Install dependencies
pip install transformers accelerate torch
# Pull model from Hugging Face
from transformers import AutoModelForCausalLM, AutoTokenizer
model_id = "nvidia/NVIDIA-Nemotron-3-Super-120B-A12B-FP8"
tokenizer = AutoTokenizer.from_pretrained(model_id)
model = AutoModelForCausalLM.from_pretrained(
model_id,
device_map="auto",
torch_dtype="auto"
)
# Inference with reasoning ON
messages = [{"role": "user", "content": "Analyze this codebase for security vulnerabilities."}]
inputs = tokenizer.apply_chat_template(
messages,
tokenize=True,
return_tensors="pt",
add_generation_prompt=True,
enable_thinking=True # Toggle: True / "low_effort" / False
)
output = model.generate(inputs, max_new_tokens=2048, temperature=1.0, top_p=0.95)
Option 2: Try via OpenRouter (free, rate-limited)
curl -X POST https://openrouter.ai/api/v1/chat/completions \
-H "Authorization: Bearer YOUR_KEY" \
-H "Content-Type: application/json" \
-d '{
"model": "nvidia/nemotron-3-super-120b-a12b:free",
"messages": [{"role": "user", "content": "Your prompt here"}]
}'
Option 3: NVIDIA NIM API (build.nvidia.com)
- Create an account at build.nvidia.com
- Navigate to Nemotron 3 Super in the model catalog
- Generate an API key from your NGC account
- Use the OpenAI-compatible endpoint:
https://integrate.api.nvidia.com/v1with modelnvidia/nemotron-3-super-120b-a12b - Set
temperature=1.0, top_p=0.95for all tasks (recommended by NVIDIA across all use cases)
Option 4: Google Cloud Vertex AI
Available now in Model Garden. Requires a Google Cloud project with Vertex AI enabled. Recommended for organizations already on GCP infrastructure.
For more on building agentic AI pipelines, see our roundup of the best AI models for developers and our Promptfoo review for testing your Nemotron 3 Super deployments.
Final Verdict
Nemotron 3 Super is the most important open-source model launch of early 2026 for one specific buyer: enterprise teams building multi-agent AI systems at scale. The combination of 1M native context, 5x throughput improvement, and a fully open training stack is genuinely novel—nothing else in the open-source space offers all three simultaneously in a 120B-class model.
If you’re running agent pipelines on GPU infrastructure and the “thinking tax” is bleeding your inference budget, evaluate Nemotron 3 Super immediately. The PinchBench 85.6% score and production deployments at Perplexity and Palantir are credible validation signals, not just launch week hype.
If you’re a researcher, the open training stack—10T token dataset, NeMo RL recipes, 21-environment RL configs—is rare transparency from a major lab and worth studying regardless of whether you deploy the model.
Where it falls short: pure accuracy benchmarks (Qwen 3.5 wins), GPU hardware lock-in (Blackwell for max performance), NIM pricing opacity, and the absence of any consumer-facing product. It’s also text-only in a multimodal world.
Rating: 8.2/10. Best open-source model for agentic AI infrastructure. Not the right tool if you don’t have GPU infrastructure—or if you’re waiting for NVIDIA to tell you what NIM inference actually costs.
Frequently Asked Questions
What is Nvidia Nemotron 3 Super?
Nemotron 3 Super is an open-source large language model developed by NVIDIA with 120 billion total parameters and 12 billion active parameters. It uses a hybrid Mamba-Transformer Mixture-of-Experts architecture designed for agentic AI workloads, featuring a 1 million token context window and 5x higher throughput than its predecessor.
When was Nemotron 3 Super released?
Nemotron 3 Super was released on March 11, 2026, announced at NVIDIA’s GTC conference. Weights are available on Hugging Face and via NVIDIA NIM.
How much does Nemotron 3 Super cost?
The model weights are free and open-source under the NVIDIA Nemotron Open Model License. NVIDIA NIM API inference pricing has not been officially announced as of March 2026; check build.nvidia.com for current pricing.
What hardware does Nemotron 3 Super require?
The minimum requirement is 2x H100-80GB GPUs for the FP8 variant. For optimal performance using native NVFP4, NVIDIA Blackwell (B200) GPUs are recommended. Consumer hardware is not supported.
How does Nemotron 3 Super compare to Meta Llama 4?
Both are open-source models in the same parameter class. Nemotron 3 Super has a significantly larger 1M token context window versus Llama 4’s 128K, and achieves higher throughput with its MoE architecture. Llama 4 has broader hardware compatibility and stronger community support, plus multimodal capabilities. Nemotron 3 Super leads on SWE-bench (60.47%) and is NVIDIA-optimized.
What is NVFP4 and why does it matter?
NVFP4 is NVIDIA’s 4-bit floating-point format, used natively during Nemotron 3 Super’s pretraining. On NVIDIA B200 (Blackwell) GPUs, it delivers 4x faster inference than FP8 on H100 while maintaining accuracy. It reduces memory requirements significantly, enabling larger batches and higher throughput at scale. The caveat: NVFP4 benefits are exclusive to Blackwell-generation hardware.
What is the Latent MoE architecture in Nemotron 3 Super?
Latent MoE compresses token embeddings into a lower-dimensional space before routing them to expert layers. This lets Nemotron 3 Super consult 4x as many specialist experts for the same computational cost as a standard MoE model, enabling finer-grained task specialization without increased latency.
Is Nemotron 3 Super truly open source?
Nemotron 3 Super is open-weight with public datasets (10T tokens) and training recipes. However, it uses the NVIDIA Nemotron Open Model License, which is not OSI-approved open source—commercial use restrictions apply. Review the license at nvidia.com before building commercial products.
What is controllable reasoning in Nemotron 3 Super?
Nemotron 3 Super supports three reasoning modes: ON (full chain-of-thought for complex problems), Low Effort (reduced reasoning for speed-sensitive tasks), and OFF (direct responses for simple queries). It is toggled via the enable_thinking parameter in the chat template, set to True, “low_effort”, or False respectively.
Where can I try Nemotron 3 Super?
You can access Nemotron 3 Super via Hugging Face (nvidia/NVIDIA-Nemotron-3-Super-120B-A12B-FP8), NVIDIA NIM API at build.nvidia.com, OpenRouter (free tier available at nvidia/nemotron-3-super-120b-a12b:free), and Google Cloud or Oracle Cloud. AWS and Azure support is coming soon.


