Mercury 2 Review 2026: Inception Labs’ Diffusion LLM That Hits 1,000 Tokens Per Second

Name: Mercury 2 Review 2026: Inception Labs' Diffusion LLM That Hits 1,000 Tokens Per Second
Item: Mercury 2
Rating: 7.8
Author: ComputerTech

✓

Why you can trust ComputerTech — We spend hours hands-on testing every AI tool we review, so you get honest assessments, not marketing fluff. How we review · Affiliate disclosure

Written & tested by Sawyer RuhlPublished March 13, 2026 · Updated March 16, 2026

On February 24, 2026, Inception Labs launched Mercury 2 — and it doesn’t use transformers. While every other “fast” LLM is still decoding one token at a time, Mercury 2 generates entire drafts in parallel using diffusion architecture, breaking 1,000 tokens per second on NVIDIA Blackwell GPUs. That’s not incremental improvement. That’s a different class of machine.

The practical implication: agentic workflows that chain 50 LLM calls can now run in the time it takes GPT-4o-mini to handle 10. At $0.25/1M input tokens, you get that speed at a cost that makes high-volume production AI actually viable.

Rating: 7.8/10 ⭐⭐⭐⭐

What Is Mercury 2?

Mercury 2 is a production-grade diffusion large language model (dLLM) built by Inception Labs, launched February 24, 2026. Unlike every mainstream LLM — GPT, Claude, Gemini, Llama — Mercury 2 does not generate tokens left-to-right, one at a time. It uses a diffusion-based parallel refinement process: it outputs a rough full response, then iteratively refines it across a small number of passes. Less typewriter, more editor revising a full draft at once.

The result is the world’s fastest publicly available reasoning LLM. It’s not the most capable model for frontier reasoning tasks — but for latency-sensitive production applications, it operates in a different league. With a 128K context window, native tool use, schema-aligned JSON output, and OpenAI API compatibility, it drops into existing stacks without a rewrite.

The Story: 1,009 Tokens Per Second and a New Architecture Era

When Inception Labs claimed “5x faster than speed-optimized LLMs,” that sounded like marketing. The benchmark data makes it concrete. Mercury 2 achieves 1,009 tokens per second on NVIDIA Blackwell hardware — against roughly 71 tps for GPT-5 Mini and 89 tps for Claude 4.5 Haiku Reasoning. That’s not 5x. In many configurations, it’s closer to 11x.

NVIDIA called it out directly at launch: “Surpassing 1,000 tokens per second on NVIDIA GPUs underscores the performance, scalability, and versatility of our platform.” — Shruti Koparkar, Senior Manager, Accelerated Computing, NVIDIA.

For single-shot prompts, that speed matters somewhat. For agentic loops — where an AI agent makes 20–100 LLM calls per task — it changes the entire economics. Skyvern’s CTO put it bluntly: “Mercury 2 is at least twice as fast as GPT-5.2, which is a game changer for us.” Zed co-founder Max Brunsfeld: “Suggestions land fast enough to feel like part of your own thinking, not something you have to wait for.”

On quality benchmarks, Mercury 2 is competitive but not dominant. It scores 73.6–77% on GPQA and 91.1 on AIME 2025 — solid reasoning numbers that put it in the same tier as the speed-optimized competition, not above the frontier. The Artificial Analysis Intelligence Index places it at 33/100, ranking 22nd out of 132 models. Real talk: Mercury 2 is not the smartest model on the market. It’s the fastest mid-tier reasoning model ever deployed in production. That’s the correct framing.

Benchmark Performance

Benchmark	Mercury 2	GPT-4o-mini	Claude 3.5 Haiku	Gemini 2.5 Flash
Speed (tokens/sec)	~1,009 tps	~71 tps	~89 tps	~71 tps
Input cost (per 1M tokens)	$0.25	$0.15	$0.80	$0.30
Output cost (per 1M tokens)	$0.75	$0.60	$4.00	$2.50
GPQA (reasoning)	73.6–77%	~40%	67.2%	~75%
AIME 2025 (math)	91.1	N/A	N/A	~85
HumanEval (coding)	LiveCodeBench: 67.3	87.2%	93.7%	~88%
Context window	128K	128K	200K	1M
AA Intelligence Index	33/100	~45/100	~42/100	~70/100

Sources: Inception Labs, Artificial Analysis, OpenAI documentation, Anthropic documentation, Google DeepMind. Speed figures vary by hardware configuration; Mercury 2 figure is on NVIDIA Blackwell H100 GPUs.

Pricing

Model	Input (per 1M tokens)	Output (per 1M tokens)	Context
Mercury 2	$0.25	$0.75	128K
GPT-4o-mini	$0.15	$0.60	128K
Claude 3.5 Haiku	$0.80	$4.00	200K
Gemini 2.5 Flash	$0.30	$2.50	1M
Gemini 3.1 Flash-Lite	$0.25	$1.50	1M+

Mercury 2’s pricing is genuinely competitive. Output tokens at $0.75/1M beat Claude Haiku by 5x and Gemini Flash by 3x. GPT-4o-mini output is slightly cheaper at $0.60, but Mercury 2 is roughly 14x faster on throughput — so at scale, you’re doing more in the same wall-clock time for comparable spend. Enterprise pricing and dedicated deployments are available directly through Inception Labs.

Key Features

1. Parallel Diffusion Generation

Mercury 2 doesn’t predict the next token — it generates a full draft and iteratively refines it in parallel passes. This is the core architectural difference from every transformer-based model on the market. In practice, it means latency is nearly flat for short responses and dramatically better for long ones. The limitation: for very short responses (10–20 tokens), diffusion’s fixed-pass overhead can actually be slower than autoregressive models. Don’t use Mercury 2 when you need single-word answers at massive scale.

2. Tunable Reasoning Budget

Mercury 2 supports tunable reasoning — you can configure how many refinement passes it makes, trading quality for speed or vice versa. At minimum passes, it’s a raw speed machine. At max, it approaches reasoning-grade quality. This dial doesn’t exist on autoregressive models in the same form. The catch: the quality ceiling at max passes is still mid-tier reasoning, not frontier. It won’t replace o3 for hard math.

3. Native Tool Use & Structured JSON Output

Built-in tool use and schema-aligned JSON output make Mercury 2 directly useful in production agentic systems — not a future roadmap item. The model is tested for multi-hop retrieval, function calling, and structured extraction. In agentic loops, this matters because each step is a separate inference call; cutting per-call latency 10x means you can afford 10x more steps in the same user-facing time budget.

4. OpenAI API Compatibility

Drop-in replacement for existing OpenAI API integrations. No SDK changes, no prompt reformatting. If your stack already calls the OpenAI API, you point the base URL at Inception Labs and swap the model string. This is a real decision by a company that understands adoption friction. The limitation: it’s not a hosted consumer product — this is an API play for developers and production teams, not end users.

5. 128K Context Window

Competitive context window for most production use cases. Long-document analysis, multi-turn agents, RAG pipelines — 128K covers the overwhelming majority of real workloads. It falls short against Claude’s 200K and Gemini Flash’s 1M for edge cases like full codebase analysis or very long document processing. For most agentic work: not a bottleneck.

Who Is It For / Who Should Look Elsewhere

Use Mercury 2 if you:

Build agentic pipelines with 20+ LLM calls per task — the compounding latency reduction is where Mercury 2’s advantage is most dramatic
Run real-time voice AI or conversational agents — natural speech cadences require sub-500ms latency; Mercury 2 is one of the few models that can deliver this with reasoning-grade quality
Operate high-volume inference at scale — structured extraction, classification, RAG reranking at millions of calls per day where output cost per token is the primary constraint
Need coding autocomplete or next-edit suggestion speed — the “feels like part of your own thinking” bar requires under 50ms perceived latency; Mercury 2 hits it where other models don’t
Want to experiment with a novel architecture before the rest of the market catches up — diffusion LLMs are early; getting experience with the paradigm now has strategic value

Look elsewhere if you:

Need frontier reasoning quality — for complex multi-step math, scientific analysis, or hard coding problems, o3 or Claude Opus remain significantly stronger
Require very long context windows (>128K) — Gemini Flash’s 1M context window is unmatched for full-codebase or long-document tasks
Are building a consumer-facing chat product — Mercury 2 is API-only; there’s no UI to hand to non-technical users
Generate primarily very short responses at high volume — diffusion’s fixed-pass overhead doesn’t pay off for 5–10 token outputs

Comparison Table: Mercury 2 vs The Competition

Feature	Mercury 2	GPT-4o-mini	Claude 3.5 Haiku	Gemini 2.5 Flash
Architecture	Diffusion (parallel)	Transformer (autoregressive)	Transformer (autoregressive)	Transformer (autoregressive)
Speed	~1,009 tps ⚡	~71 tps	~89 tps	~71 tps
Input pricing	$0.25/1M	$0.15/1M	$0.80/1M	$0.30/1M
Output pricing	$0.75/1M ✅	$0.60/1M	$4.00/1M	$2.50/1M
Context window	128K	128K	200K	1M
Reasoning quality	Mid-tier (GPQA 73–77%)	Mid-tier (GPQA ~40%)	Mid-tier (GPQA 67%)	High (GPQA ~75%)
Native tool use	✅ Yes	✅ Yes	✅ Yes	✅ Yes
Structured JSON output	✅ Schema-aligned	✅ Yes	✅ Yes	✅ Yes
OpenAI API compatible	✅ Drop-in	Native	❌ Requires SDK	❌ Requires SDK
Best for	Agentic loops, voice, real-time	General-purpose, cost-sensitive	Document analysis, writing	Long context, multimodal

Controversy: What They Don’t Advertise About Diffusion LLMs

Mercury 2 is genuinely fast. The architecture is genuinely novel. But diffusion LLMs come with real trade-offs that the launch blog doesn’t lead with.

Fixed-output overhead problem. Diffusion models perform a fixed number of denoising passes regardless of output length. For responses under 20–30 tokens, this overhead can actually make Mercury 2 slower than an autoregressive model. If your app is generating short structured outputs — classification labels, yes/no answers, single-sentence responses — the speed advantage evaporates or reverses.

Quality ceiling is real. Mercury 2 scores 33/100 on the Artificial Analysis Intelligence Index, ranking 22nd out of 132 models. That puts it firmly in the “fast mid-tier” category, not frontier. The GPQA score of 73–77% looks good in isolation, but Gemini 2.5 Flash — which is also a speed-optimized model — matches it on reasoning while offering a 1M context window at comparable pricing. Mercury 2’s moat is speed, not intelligence.

Scaling laws are uncharted. Diffusion LLMs are a new paradigm. The scaling behaviors that are well-understood for transformers — how quality improves with more parameters, more data, more compute — aren’t established for diffusion LLMs at large scale. Mercury 2 is built on roughly 8B parameters. Whether the architecture can scale to 70B+ with proportional quality gains is an open research question, not a proven roadmap.

Complex sequential reasoning may be a structural weakness. Research suggests that for tasks requiring very low sequence error rates — where one wrong token in a chain of logic breaks everything — diffusion models may need linearly more sampling steps to match autoregressive accuracy. The parallel-refinement approach that makes Mercury 2 fast may be a structural disadvantage for sequential, step-dependent reasoning chains.

Smaller ecosystem and less community tooling. GPT-4o-mini and Claude Haiku have years of community-built prompt libraries, eval frameworks, and production case studies. Mercury 2 is new. Debugging diffusion model behavior, calibrating prompts for it, and understanding its failure modes are all on you — there’s no Stack Overflow corpus for this yet.

Pros and Cons

Pros

World’s fastest publicly available reasoning LLM — 1,009 tokens/sec on NVIDIA Blackwell, 11x faster than autoregressive competitors
Output cost ($0.75/1M) dramatically undercuts Claude Haiku ($4.00) and Gemini Flash ($2.50), enabling high-volume production viability
OpenAI API drop-in compatibility — zero integration overhead for existing stacks
Native tool use and schema-aligned JSON output built in, not bolted on
Tunable reasoning budget gives developers real control over the quality/speed trade-off
Real agentic loop performance validated by Skyvern, Wispr Flow, and Viant at production scale
GPQA score of 73–77% is stronger than GPT-4o-mini’s ~40% — better reasoning than the pricing suggests

Cons

Intelligence Index rank of 22nd/132 — not a frontier reasoning model; don’t use it for hard math or complex multi-step logic
Slower than autoregressive for very short outputs (under ~20 tokens) due to fixed diffusion pass overhead
128K context window is competitive but falls short of Claude (200K) and Gemini Flash (1M) for long-document tasks
No consumer UI — API-only, developer-facing product; not a ChatGPT replacement
Diffusion LLM scaling laws are unproven; future capability trajectory is uncertain compared to transformers
Small ecosystem: limited community tooling, prompt libraries, and production case studies vs. GPT/Claude

Getting Started with Mercury 2

Mercury 2 is OpenAI API compatible, so integration is minimal friction if you’re already in that ecosystem.

Get API access. Visit inceptionlabs.ai and request API access or sign up for the developer tier. Enterprise teams can contact Inception Labs directly for dedicated deployment options.
Install the OpenAI Python SDK (if not already installed): pip install openai. You don’t need a new SDK — Mercury 2 speaks the same protocol.
Update your base URL. In your existing OpenAI client configuration, set the base URL to Inception Labs’ API endpoint and swap your API key for the one provided by Inception. The model string will be inception/mercury-2 or similar — confirm in their documentation.
Test with an agentic loop benchmark. Don’t just run a single prompt. Mercury 2’s advantage compounds across multi-step calls. Build a 10-step agent workflow, run it on Mercury 2 vs. your current model, and measure total wall-clock time. The gap will be larger than you expect.
Tune the reasoning budget. Start with the default reasoning pass configuration and then experiment with lower pass counts for latency-sensitive paths (autocomplete, voice) and higher counts for tasks where quality matters more. This is a parameter other models don’t expose.

Frequently Asked Questions

What is Mercury 2?

Mercury 2 is a diffusion-based large language model (dLLM) developed by Inception Labs and launched February 24, 2026. Unlike traditional autoregressive models (GPT, Claude, Gemini), it generates text through parallel refinement rather than sequential token-by-token decoding, achieving over 1,000 tokens per second in production.

How fast is Mercury 2 compared to GPT-4o-mini?

Mercury 2 achieves approximately 1,009 tokens per second on NVIDIA Blackwell GPUs, compared to roughly 71 tokens per second for GPT-4o-mini. That’s approximately 14x faster throughput. Inception Labs claims 5–11x faster than speed-optimized LLMs depending on configuration and hardware.

How much does Mercury 2 cost?

Mercury 2 is priced at $0.25 per million input tokens and $0.75 per million output tokens. This makes it significantly cheaper on output than Claude 3.5 Haiku ($4.00/1M) and Gemini 2.5 Flash ($2.50/1M), while being slightly more expensive than GPT-4o-mini ($0.60/1M output).

What makes Mercury 2 different from GPT-4o-mini and Claude Haiku?

The fundamental difference is architecture. GPT-4o-mini and Claude Haiku are autoregressive transformers — they generate one token at a time, left to right. Mercury 2 uses diffusion-based parallel generation, producing a full draft and refining it across multiple passes simultaneously. This makes it 10–14x faster, though the quality ceiling for complex reasoning is lower.

Is Mercury 2 good for agentic AI workflows?

Yes — agentic workflows are Mercury 2’s primary strength. Because agents chain dozens or hundreds of LLM calls per task, Mercury 2’s per-call speed advantage compounds dramatically. Companies like Skyvern and Viant have deployed it specifically for agentic use cases and report significant performance improvements. It supports native tool use and structured JSON output.

What are the limitations of Mercury 2?

Mercury 2’s key limitations include: (1) mid-tier reasoning quality — it ranks 22nd/132 on Artificial Analysis, not a frontier model; (2) slower than autoregressive for very short outputs due to fixed diffusion pass overhead; (3) 128K context window vs. Gemini Flash’s 1M; (4) API-only, no consumer UI; and (5) an unproven scaling trajectory for the diffusion architecture at larger parameter counts.

Is Mercury 2 OpenAI API compatible?

Yes. Mercury 2 is fully OpenAI API compatible. If your stack already uses the OpenAI API, you can switch to Mercury 2 by updating the base URL and API key — no code changes, no prompt reformatting, no new SDKs required.

What is a diffusion LLM and how is it different from a regular LLM?

Traditional LLMs (GPT, Claude, Gemini) use autoregressive decoding — generating one token at a time in sequence, like typing. Diffusion LLMs work like editing: they generate an entire draft all at once, then refine it iteratively over multiple passes. This parallel approach is dramatically faster for medium-to-long outputs, but behaves differently for short outputs and sequential reasoning chains.

Should I use Mercury 2 for voice AI applications?

Mercury 2 is one of the strongest options for real-time voice AI. Voice interfaces require sub-500ms latency to feel natural — a bar most LLMs can’t consistently hit under load. Mercury 2’s throughput enables reasoning-quality responses within natural speech cadences. OpenCall and Happyverse AI have both deployed it specifically for voice applications.

Is Mercury 2 worth it in 2026?

For latency-sensitive production applications — agentic workflows, real-time voice, coding autocomplete, high-volume RAG pipelines — Mercury 2 is worth serious evaluation. For general-purpose AI work, frontier reasoning tasks, or consumer-facing products, it’s not the right fit. The architecture is genuinely novel and the speed advantage is real. If your bottleneck is inference latency, Mercury 2 is the most effective solution on the market right now.

Final Verdict

Mercury 2 is the most important AI infrastructure launch of early 2026 that most people aren’t talking about. Not because it’s the smartest model — it’s not — but because it proves diffusion LLMs can hit production at the quality tier where most commercial work actually lives. 1,009 tokens per second is not a benchmark trick. It’s a fundamentally different cost structure for anyone running agents at scale.

If you’re building latency-sensitive applications — voice AI, coding agents, real-time search, multi-step agentic workflows — Mercury 2 deserves immediate evaluation. The OpenAI API compatibility means the test costs you an afternoon, not a sprint. If speed is your bottleneck, this is the most direct solution available right now.

If you need frontier reasoning quality, or you’re building general-purpose AI products, or you need 1M+ context windows — keep your current stack and revisit diffusion LLMs in 12 months when the architecture matures. The quality ceiling is real, and the ecosystem is thin.

Mercury 2 Review Score: 7.8/10 — A genuine architectural breakthrough deployed in production. Narrow use case, but near-perfect execution within it. The first diffusion LLM worth serious production consideration.

ComputerTech Editorial Team

Our team tests every AI tool hands-on before reviewing it. With 126+ tools evaluated across 8 categories, we focus on real-world performance, honest pricing analysis, and practical recommendations. Learn more about our review process →