Nvidia Nemotron 3 Super Review 2026: The Open-Source Inference Speed King

Name: Nvidia Nemotron 3 Super Review 2026: The Open-Source Inference Speed King
Item: NVIDIA Nemotron 3 Super
Rating: 8.2
Author: ComputerTech

✓

Why you can trust ComputerTech — We spend hours hands-on testing every AI tool we review, so you get honest assessments, not marketing fluff. How we review · Affiliate disclosure

Written & tested by Sawyer RuhlPublished March 13, 2026 · Updated March 13, 2026

NVIDIA launched Nemotron 3 Super at GTC on March 11, 2026, and the benchmark that caught the AI research community’s attention wasn’t just a model score—it was a throughput number. At 5x the inference speed of the previous Nemotron Super with 2.2x higher throughput than competing 120B-class models, this is the first open-source model that makes running a 120 billion parameter model feel like a 12 billion parameter model at inference time. That’s not marketing. That’s the architecture doing its job.

Rating: 8.2/10 ⭐⭐⭐⭐

What Is Nvidia Nemotron 3 Super?

Nemotron 3 Super is an open-source large language model developed by NVIDIA Corporation and released on March 11, 2026. It sits in the middle of the Nemotron 3 family: above Nano (30B total / 3B active) and below the upcoming Ultra (~500B total / ~50B active). The full parameter count is 120B, but through Mixture-of-Experts routing, only 12B parameters activate per token—delivering frontier-class reasoning at mid-tier inference cost.

Its one-line differentiator: a hybrid Mamba-Transformer MoE model with a native 1 million token context window, optimized from the ground up for multi-agent, long-horizon AI systems. Weights, training data (10 trillion tokens), and training recipes are all public. It’s available on Hugging Face and via NVIDIA NIM.

The Story: Why 5x Throughput Changes the Economics of Agentic AI

Multi-agent AI systems have a dirty secret: they generate up to 15x more tokens than standard chat interactions. Every sub-agent call re-sends history, tool outputs, and reasoning chains. That “context explosion” makes deploying large reasoning models at scale prohibitively expensive—NVIDIA calls it the “thinking tax.”

Nemotron 3 Super is engineered specifically to eliminate that tax. Three architectural decisions make it possible:

1. Hybrid Mamba-Transformer Backbone. Mamba-2 state space model layers handle the bulk of sequence processing at linear time complexity—so a 1 million token context doesn’t blow up memory. Transformer attention layers are interleaved at key depths to preserve precise associative recall (the “needle in a haystack” problem). Combined, the architecture achieves 4x better memory and compute efficiency versus pure Transformer at the same parameter count.

2. Latent MoE. Before tokens reach expert layers, they’re compressed into a low-rank latent space. Expert computation happens in this smaller dimension, then projects back. The result: Nemotron 3 Super can call 4x as many expert specialists for the same inference cost as a standard MoE. Granular specialization—Python syntax experts, SQL logic experts, cybersecurity triage experts—at zero added cost per inference.

3. Multi-Token Prediction (MTP). Instead of predicting one token per forward pass, MTP heads forecast multiple future tokens simultaneously. This provides up to a 3x wall-clock speedup on long sequences and acts as built-in speculative decoding without a separate draft model.

The practical outcome: on PinchBench—a benchmark specifically measuring LLM performance as an autonomous agent brain—Nemotron 3 Super scores 85.6%, ranking as the best open model in its class as of launch.

Benchmark Performance

Here’s how Nemotron 3 Super performs against top open-source competition across standard evaluation suites. Note that NVIDIA’s benchmarks are self-reported; third-party reproduction is ongoing as of March 2026.

Benchmark	Nemotron 3 Super (120B/12B active)	Qwen 3.5 122B	DeepSeek V3	Meta Llama 4
MMLU-Pro	83.73%	86.70%	81.6%	~80%
GPQA (no tools)	79.23%	86.60%	73.3%	~75%
GPQA (with tools)	82.70%	N/A	N/A	N/A
SWE-bench Verified	60.47%	~66%	49.2%	~55%
HumanEval	79.40%	~82%	82.6%	~78%
PinchBench (Agentic)	85.6%	N/A	N/A	N/A
Throughput vs 120B-class	2.2x faster	1x baseline	~1.1x	~1x
Context Window	1M tokens	128K tokens	128K tokens	128K tokens

Sources: NVIDIA Technical Report, llm-stats.com, Artificial Analysis. Competitor figures are independently reported; some are approximate as of March 2026.

The headline: Nemotron 3 Super isn’t the outright accuracy leader (Qwen 3.5 122B outscores it on MMLU-Pro and GPQA without tools), but when you factor in throughput and the 1M context window, it’s the only open model that can run agentic workloads at frontier-class quality without bankrupting your inference budget.

Pricing

Access Method	Cost	Notes
Open Weights (HuggingFace)	Free	Self-host. Min 2x H100-80GB. NVIDIA Nemotron Open License.
OpenRouter (free tier)	Free (rate-limited)	nvidia/nemotron-3-super-120b-a12b:free
NVIDIA NIM API	TBD	Pricing not announced at launch. Check build.nvidia.com.
Google Cloud	Cloud GPU rates	Available now via Vertex AI / Model Garden.
Oracle Cloud	Cloud GPU rates	Available at launch. AWS/Azure coming.

Model	Open Weights	API Pricing (Input/1M tokens)
Nemotron 3 Super	✅ Free	TBD
Meta Llama 4	✅ Free	~$0.40–$0.60 (via cloud APIs)
Mistral Large 2	✅ Free	~$2.00 (Mistral API)
DeepSeek V3	✅ Free	$0.27 (DeepSeek API)

Note: NIM pricing opacity is a genuine concern covered in the controversy section below.

Key Features

1. Native 1 Million Token Context Window
This isn’t a “theoretical” context window achieved through chunking tricks. Mamba-2 layers provide linear-time sequence complexity, making 1M tokens computationally tractable. For agentic AI, this means an agent can hold an entire codebase, full conversation history, and retrieved document sets in working memory simultaneously. The limitation: maximum context performance requires sufficient GPU memory—running 1M context on minimum hardware (2x H100-80GB) will constrain batch sizes significantly.

2. Controllable Reasoning (ON / Low Effort / OFF)
Via the enable_thinking parameter in the chat template, users can dial reasoning intensity to match task complexity. Full reasoning ON for multi-step coding and analysis; Low Effort for moderate complexity; OFF for direct conversational responses. This matters economically—unnecessary chain-of-thought on simple queries burns tokens. The limitation: the Low Effort mode’s quality ceiling compared to full reasoning hasn’t been independently benchmarked at scale yet.

3. NVFP4 Native Pretraining
Nemotron 3 Super is the first model pretrained natively in NVIDIA’s 4-bit floating-point format. On Blackwell B200 GPUs, this delivers 4x faster inference than FP8 on H100—not just quantization, but native precision. The limitation: NVFP4 is exclusive to Blackwell-generation hardware. If you’re running H100 or older, you’re using the FP8 variant at roughly standard speeds. This is the GPU lock-in issue (see Controversy section).

4. Fully Open Training Stack
NVIDIA released not just weights but the complete training recipe: 10 trillion unique curated tokens (public dataset collections on HuggingFace), NeMo RL post-training code, and NeMo Gym configurations for the 21-environment reinforcement learning setup. Organizations can reproduce, fine-tune, or build custom variants from first principles. Limitation: training infrastructure requirements are substantial—reproducing the full training run requires significant Blackwell cluster access.

5. Multi-Language Support
English, French, German, Italian, Japanese, Spanish, and Chinese are natively supported. This expands the viable deployment base beyond English-only enterprise use cases. SWE-bench Multilingual scores (45.78%) are significantly lower than English SWE-bench (60.47%), signaling that non-English coding tasks see a real quality drop.

6. Enterprise Early Adoption
Perplexity, Palantir, Cadence, and Siemens were live on Nemotron 3 Super at launch—not a beta program, production deployments. Perplexity’s use case (high-volume inference at scale) is the strongest validation signal. Palantir’s adoption (government/defense AI) suggests the model passed security and capability reviews most open models never see. Limitation: enterprise customer case studies with measurable outcomes aren’t public yet.

Who Is It For / Who Should Look Elsewhere

Use Nemotron 3 Super if you:

Run multi-agent AI pipelines at scale where inference cost and throughput are primary constraints
Need a 1 million token context window for codebase analysis, long-document RAG, or enterprise knowledge retrieval
Have or are acquiring NVIDIA Blackwell (B200) hardware and want to extract maximum performance
Are a researcher who needs an open training stack to study, reproduce, or extend cutting-edge agentic AI
Build enterprise AI applications in cybersecurity, software development automation, or IT operations

Look elsewhere if you:

Are a casual user or developer without dedicated GPU infrastructure—this model has no consumer product
Need pure benchmark accuracy above throughput (Qwen 3.5 122B outscores on MMLU-Pro and GPQA)
Are running on non-NVIDIA hardware (AMD ROCm, Apple Silicon)—NVFP4 optimizations are Blackwell-exclusive
Need multimodal capabilities—Nemotron 3 Super is text-only

Comparison Table

Feature	Nemotron 3 Super	Meta Llama 4	Mistral Large 2	DeepSeek V3
Total Parameters	120B (12B active)	~70B	123B	671B (37B active)
Architecture	Hybrid Mamba-Transformer + Latent MoE	Dense Transformer	Dense Transformer	MoE Transformer
Context Window	1M tokens	128K tokens	128K tokens	128K tokens
Open Weights	✅ Yes	✅ Yes (Llama license)	✅ Yes (MRL)	✅ Yes (MIT)
License	NVIDIA Nemotron Open Model License	Llama Community License	Mistral Research License	MIT
API Pricing (per 1M in tokens)	TBD (NIM)	~$0.40–0.60	~$2.00	$0.27
Best For	Agentic AI, long-context, high-throughput	General purpose, broad ecosystem	European data residency, multilingual	Cost-efficient inference, coding
Min Hardware	2x H100-80GB	1x A100-80GB (quantized)	2x A100-80GB	4x H100-80GB
Multimodal	❌ Text only	✅ Text + Image	❌ Text only	❌ Text only
SWE-bench Verified	60.47%	~55%	~44%	49.2%
Controllable Reasoning	✅ ON / Low / OFF	❌	❌	✅ Extended Thinking (V3)
Cloud Availability	Google Cloud, Oracle (AWS/Azure soon)	All major clouds	All major clouds	DeepSeek API + major clouds
Training Data	25T tokens (10T public)	Undisclosed	Undisclosed	Undisclosed

Controversy: What They Don’t Advertise

The GPU Lock-In Problem

NVFP4 is the headline performance claim—4x faster on B200 vs FP8 on H100. But read the fine print: NVFP4 is exclusive to NVIDIA Blackwell architecture. If you’re running a current H100 cluster (the most common enterprise GPU deployment as of 2026), you get the FP8 variant at standard H100 speeds. The “5x throughput” headline improvement is largely an architectural upgrade plus the B200-specific NVFP4 gains compounded together. Organizations on pre-Blackwell hardware get a genuinely better model, but not the full marketing headline. This isn’t disclosed prominently in NVIDIA’s launch materials.

NVIDIA’s Vertical Integration Play

Think about the positioning: NVIDIA releases a state-of-the-art open model that performs best on NVIDIA hardware (Blackwell), distributed via NVIDIA NIM API (with undisclosed pricing), accessible through NVIDIA-preferred cloud partners (Google Cloud, Oracle), using NVIDIA’s NeMo training stack, and benchmarked on a new benchmark (PinchBench) where it scores highest. Every layer of the stack—chips, software, distribution, evaluation—is NVIDIA. The “open” model is genuinely open in terms of weights and training data, but the ecosystem moat is pure NVIDIA. Whether this is a strategic lock-in play or just good vertical integration depends on your tolerance for single-vendor dependency.

Open Weights ≠ Open Source

The NVIDIA Nemotron Open Model License is not an OSI-approved open-source license. Commercial use requires reading the fine print carefully—particularly around distribution, derivative models, and attribution. This contrasts with DeepSeek V3 (MIT license) and is more restrictive than the Llama Community License for many use cases. The marketing calls it “fully open” but the legal reality is a custom NVIDIA license. For enterprise legal teams, this requires review before production deployment.

NIM Pricing Opacity

NVIDIA announced the model but not the NIM API pricing. For developers who want managed inference without the GPU infrastructure overhead, this is a genuine blocker. The pattern of releasing a major model without API pricing is unusual—and it creates friction for the exact enterprise audience NVIDIA is targeting. Perplexity and Palantir presumably have volume contracts; SMB developers don’t know what they’re signing up for.

Pros and Cons

Pros:

5x throughput improvement over previous Nemotron Super—genuinely transformative for multi-agent deployments
1M native context window—largest in the open-source 120B parameter class by a wide margin
Fully open training stack—weights, 10T token dataset, training recipes, NeMo Gym configs all public
NVIDIA enterprise backing—Perplexity, Palantir, Cadence, Siemens live at launch (real-world validation)
Controllable reasoning modes—ON/Low Effort/OFF lets you optimize token spend per task type
Best-in-class agentic benchmark—85.6% on PinchBench as the leading open agentic model at launch
Multilingual support—7 languages natively, unlike many open models that are English-first

Cons:

No consumer product—purely a model for developers and enterprises with GPU infrastructure
NVFP4 is Blackwell-exclusive—max performance requires next-gen GPU hardware most orgs don’t have yet
NIM API pricing undisclosed—you can’t budget for managed inference without knowing costs
Not pure OSI open source—NVIDIA Nemotron License requires careful legal review for commercial use
Below Qwen 3.5 on accuracy benchmarks—MMLU-Pro (83.73% vs 86.70%) and GPQA (79.23% vs 86.60%) without tools
Text-only—no vision, no audio; competitors like Llama 4 are multimodal

Getting Started

Option 1: Self-host via Hugging Face (FP8 variant, 2x H100-80GB minimum)

# Install dependencies
pip install transformers accelerate torch

# Pull model from Hugging Face
from transformers import AutoModelForCausalLM, AutoTokenizer

model_id = "nvidia/NVIDIA-Nemotron-3-Super-120B-A12B-FP8"
tokenizer = AutoTokenizer.from_pretrained(model_id)
model = AutoModelForCausalLM.from_pretrained(
    model_id, 
    device_map="auto",
    torch_dtype="auto"
)

# Inference with reasoning ON
messages = [{"role": "user", "content": "Analyze this codebase for security vulnerabilities."}]
inputs = tokenizer.apply_chat_template(
    messages, 
    tokenize=True, 
    return_tensors="pt",
    add_generation_prompt=True,
    enable_thinking=True  # Toggle: True / "low_effort" / False
)
output = model.generate(inputs, max_new_tokens=2048, temperature=1.0, top_p=0.95)

Option 2: Try via OpenRouter (free, rate-limited)

curl -X POST https://openrouter.ai/api/v1/chat/completions \
  -H "Authorization: Bearer YOUR_KEY" \
  -H "Content-Type: application/json" \
  -d '{
    "model": "nvidia/nemotron-3-super-120b-a12b:free",
    "messages": [{"role": "user", "content": "Your prompt here"}]
  }'

Option 3: NVIDIA NIM API (build.nvidia.com)

Create an account at build.nvidia.com
Navigate to Nemotron 3 Super in the model catalog
Generate an API key from your NGC account
Use the OpenAI-compatible endpoint: https://integrate.api.nvidia.com/v1 with model nvidia/nemotron-3-super-120b-a12b
Set temperature=1.0, top_p=0.95 for all tasks (recommended by NVIDIA across all use cases)

Option 4: Google Cloud Vertex AI
Available now in Model Garden. Requires a Google Cloud project with Vertex AI enabled. Recommended for organizations already on GCP infrastructure.

For more on building agentic AI pipelines, see our roundup of the best AI models for developers and our Promptfoo review for testing your Nemotron 3 Super deployments.

Final Verdict

Nemotron 3 Super is the most important open-source model launch of early 2026 for one specific buyer: enterprise teams building multi-agent AI systems at scale. The combination of 1M native context, 5x throughput improvement, and a fully open training stack is genuinely novel—nothing else in the open-source space offers all three simultaneously in a 120B-class model.

If you’re running agent pipelines on GPU infrastructure and the “thinking tax” is bleeding your inference budget, evaluate Nemotron 3 Super immediately. The PinchBench 85.6% score and production deployments at Perplexity and Palantir are credible validation signals, not just launch week hype.

If you’re a researcher, the open training stack—10T token dataset, NeMo RL recipes, 21-environment RL configs—is rare transparency from a major lab and worth studying regardless of whether you deploy the model.

Where it falls short: pure accuracy benchmarks (Qwen 3.5 wins), GPU hardware lock-in (Blackwell for max performance), NIM pricing opacity, and the absence of any consumer-facing product. It’s also text-only in a multimodal world.

Rating: 8.2/10. Best open-source model for agentic AI infrastructure. Not the right tool if you don’t have GPU infrastructure—or if you’re waiting for NVIDIA to tell you what NIM inference actually costs.

Frequently Asked Questions

What is Nvidia Nemotron 3 Super?

Nemotron 3 Super is an open-source large language model developed by NVIDIA with 120 billion total parameters and 12 billion active parameters. It uses a hybrid Mamba-Transformer Mixture-of-Experts architecture designed for agentic AI workloads, featuring a 1 million token context window and 5x higher throughput than its predecessor.

When was Nemotron 3 Super released?

Nemotron 3 Super was released on March 11, 2026, announced at NVIDIA’s GTC conference. Weights are available on Hugging Face and via NVIDIA NIM.

How much does Nemotron 3 Super cost?

The model weights are free and open-source under the NVIDIA Nemotron Open Model License. NVIDIA NIM API inference pricing has not been officially announced as of March 2026; check build.nvidia.com for current pricing.

What hardware does Nemotron 3 Super require?

The minimum requirement is 2x H100-80GB GPUs for the FP8 variant. For optimal performance using native NVFP4, NVIDIA Blackwell (B200) GPUs are recommended. Consumer hardware is not supported.

How does Nemotron 3 Super compare to Meta Llama 4?

Both are open-source models in the same parameter class. Nemotron 3 Super has a significantly larger 1M token context window versus Llama 4’s 128K, and achieves higher throughput with its MoE architecture. Llama 4 has broader hardware compatibility and stronger community support, plus multimodal capabilities. Nemotron 3 Super leads on SWE-bench (60.47%) and is NVIDIA-optimized.

What is NVFP4 and why does it matter?

NVFP4 is NVIDIA’s 4-bit floating-point format, used natively during Nemotron 3 Super’s pretraining. On NVIDIA B200 (Blackwell) GPUs, it delivers 4x faster inference than FP8 on H100 while maintaining accuracy. It reduces memory requirements significantly, enabling larger batches and higher throughput at scale. The caveat: NVFP4 benefits are exclusive to Blackwell-generation hardware.

What is the Latent MoE architecture in Nemotron 3 Super?

Latent MoE compresses token embeddings into a lower-dimensional space before routing them to expert layers. This lets Nemotron 3 Super consult 4x as many specialist experts for the same computational cost as a standard MoE model, enabling finer-grained task specialization without increased latency.

Is Nemotron 3 Super truly open source?

Nemotron 3 Super is open-weight with public datasets (10T tokens) and training recipes. However, it uses the NVIDIA Nemotron Open Model License, which is not OSI-approved open source—commercial use restrictions apply. Review the license at nvidia.com before building commercial products.

What is controllable reasoning in Nemotron 3 Super?

Nemotron 3 Super supports three reasoning modes: ON (full chain-of-thought for complex problems), Low Effort (reduced reasoning for speed-sensitive tasks), and OFF (direct responses for simple queries). It is toggled via the enable_thinking parameter in the chat template, set to True, “low_effort”, or False respectively.

Where can I try Nemotron 3 Super?

You can access Nemotron 3 Super via Hugging Face (nvidia/NVIDIA-Nemotron-3-Super-120B-A12B-FP8), NVIDIA NIM API at build.nvidia.com, OpenRouter (free tier available at nvidia/nemotron-3-super-120b-a12b:free), and Google Cloud or Oracle Cloud. AWS and Azure support is coming soon.

ComputerTech Editorial Team

Our team tests every AI tool hands-on before reviewing it. With 126+ tools evaluated across 8 categories, we focus on real-world performance, honest pricing analysis, and practical recommendations. Learn more about our review process →