Qwen 3.5 Review 2026: The Full Model Family — From 0.8B Edge to 122B Enterprise
Alibaba just dropped the rest of the Qwen 3.5 family. What started as a Medium-series launch on February 24, 2026 became a full-stack AI lineup when the Small series — 0.8B, 2B, 4B, and 9B — landed on March 2, 2026. The Qwen 3.5 lineup now covers every deployment scenario: phones and edge devices at the bottom, single-consumer GPUs in the middle, and server-grade infrastructure at the top. Every model carries an Apache 2.0 license. No strings.
The headline number from the Small series: the 9B model beats GPT-OSS-120B on GPQA Diamond (81.7 vs 71.5). A 9-billion parameter open-source model outperforming a 120-billion parameter proprietary model on a graduate-level science benchmark. That’s the intelligence density story r/LocalLLaMA has been losing their minds over since the drop.
This is the complete Qwen 3.5 review — all sizes, all benchmarks, pricing, who each one is actually for, and where the limits are.
Rating: 8.2/10 ⭐⭐⭐⭐
What Is Qwen 3.5?
Qwen 3.5 is Alibaba’s third-generation open-source AI model series, developed by the Qwen team at Alibaba Cloud. The full family spans nine distinct models from 0.8B to 397B parameters, with the majority released under Apache 2.0 — meaning commercial use, fine-tuning, and redistribution are all permitted without license negotiations.
The architecture breakthrough driving the family: a hybrid Gated Delta Network + Mixture-of-Experts design that delivers linear attention scaling for long sequences and activates only a fraction of total parameters per token. The result is models that punch far above their parameter count while remaining practical to run on real hardware. All models support native multimodal input (vision + text), native tool calling, and a built-in Thinking Mode for step-by-step reasoning.
Official site: qwen.ai | Models: Hugging Face Hub
The Full Qwen 3.5 Lineup: All Sizes at a Glance
The family released in three waves. Here’s the complete picture:
| Model | Total Params | Active Params | Context | License | Release Date | Best For |
|---|---|---|---|---|---|---|
| Qwen3.5-0.8B | 0.8B | 0.8B (dense) | 32K | Apache 2.0 | 2026-03-02 | Edge devices, mobile, embedded |
| Qwen3.5-2B | 2B | 2B (dense) | 32K | Apache 2.0 | 2026-03-02 | Latency-sensitive apps, IoT |
| Qwen3.5-4B | 4B | 4B (dense) | 32K | Apache 2.0 | 2026-03-02 | Lightweight agents, multimodal apps |
| Qwen3.5-9B | 9B | 9B (dense) | 128K | Apache 2.0 | 2026-03-02 | Local power users, 8–16GB VRAM |
| Qwen3.5-27B | 27B | 27B (dense) | 262K | Apache 2.0 | 2026-02-24 | Document processing, simple deployment |
| Qwen3.5-35B-A3B | 35B | 3B (MoE) | 1M+ tokens | Apache 2.0 | 2026-02-24 | Local frontier AI, 32GB VRAM |
| Qwen3.5-Flash | ~35B | ~3B (MoE) | 1M+ tokens | API-only | 2026-02-24 | High-volume API, no infra overhead |
| Qwen3.5-122B-A10B | 122B | 10B (MoE) | 1M+ tokens | Apache 2.0 | 2026-02-24 | Enterprise on-prem, server GPU |
| Qwen3.5-397B-A17B | 397B | 17B (MoE) | 1M+ tokens | Apache 2.0 | 2026-02-16 | Frontier-tier, multi-node clusters |
The Story: Why r/LocalLLaMA Is Paying Attention
The Small series release on March 2 is what flipped the conversation from “interesting Medium release” to “this family is something different.” The benchmark that’s been circulating: Qwen3.5-9B scores 81.7 on GPQA Diamond, beating GPT-OSS-120B’s 71.5. GPQA Diamond is graduate-level science questions — physics, chemistry, biology — designed to stump most PhD students. A 9B model beating a 120B model on that benchmark is not a rounding error. It’s an architecture story.
Elon Musk’s reaction on X — “Impressive intelligence density” — captured what the community was already thinking. The ratio of capability to model size has jumped substantially. The Qwen3.5-4B is positioned specifically as a multimodal agent base model, and the 0.8B runs on-device inference on modern smartphones. This isn’t a research curiosity — it’s a deployable stack from edge to datacenter.
Benchmarks: The Numbers Worth Caring About
Small Series (0.8B – 9B) — Released March 2, 2026
| Benchmark | Qwen3.5-9B | GPT-OSS-120B | Qwen3-Next-80B-A3B |
|---|---|---|---|
| GPQA Diamond | 81.7 | 71.5 | — |
| HMMT Feb 2025 | 83.2 | 76.7 | — |
| MMMU-Pro | 70.1 | 59.7 | — |
| MMMLU (multilingual) | 81.2 | — | 81.3 |
| OmniDocBench v1.5 | 87.7 | — | — |
| ERQA | 55.5 | 44.3 | — |
Source: Alibaba Qwen team release benchmarks (March 2026). Note: GPT-OSS-120B benchmarks from Alibaba comparative data — third-party reproduction ongoing.
Medium Series (27B – 122B) — Released February 24, 2026
| Benchmark | 35B-A3B | 27B | 122B-A10B | GPT-5-mini | Claude Sonnet 4.5 |
|---|---|---|---|---|---|
| MMMLU | Beats | Beats | Beats | Baseline | Baseline |
| MMMU-Pro | Beats | Beats | Beats | Baseline | Baseline |
| SWE-bench Verified | — | 72.4 | — | 72.4 (tied) | — |
| BFCL-V4 (tool use) | — | — | 72.2 | — | — |
| BrowseComp | — | — | 63.8 | — | — |
| Terminal-Bench 2 | — | — | 49.4 | — | — |
Source: Alibaba Qwen team release data (February 2026). Independent community evals are ongoing.
Pricing
| Model / API | Input (per 1M tokens) | Output (per 1M tokens) | Self-Host Option |
|---|---|---|---|
| Qwen3.5-Flash (API) | $0.10 | $0.40 | No (API-only) |
| Qwen3.5-0.8B to 9B | Free (self-host) | Free (self-host) | Yes — Apache 2.0 |
| Qwen3.5-27B / 35B / 122B | Free (self-host) | Free (self-host) | Yes — Apache 2.0 |
| GPT-5-mini (OpenAI) | $0.40 | $1.60 | No |
| Claude Sonnet 4.5 (Anthropic) | $3.00 | $15.00 | No |
Qwen3.5-Flash at $0.10/M input tokens is 4× cheaper than GPT-5-mini and 30× cheaper than Claude Sonnet 4.5 at the input level. For high-volume pipelines — document processing, classification, summarization at scale — that pricing differential is the entire value proposition. Process 100M tokens/month on Flash and your input cost is $10. Same volume on Claude Sonnet 4.5 is $300.
Architecture: What’s Actually Different
The engineering story behind the benchmark numbers: Qwen 3.5 uses a hybrid Gated Delta Network + Mixture-of-Experts architecture that solves two fundamental problems simultaneously.
The MoE problem: Traditional MoE models activate a small subset of experts per token, dramatically reducing active compute. Qwen3.5-35B-A3B activates only 8.6% of its total parameters per forward pass (3B of 35B), routing each token through 8 specialized experts out of 256 total, plus 1 shared expert. You get 35B-class quality ceiling at 3B inference cost.
The long-context problem: Standard transformer attention scales quadratically with sequence length — 2× the context means 4× the compute. Qwen 3.5’s Gated Delta Networks use linear attention that scales near-linearly. The architecture alternates: 3 Gated DeltaNet layers (linear attention) for every 1 full attention layer. The result is a million-token context window that’s architecturally designed to be tractable, not just technically possible.
Thinking Mode: All models ship with built-in chain-of-thought reasoning — the model generates an internal reasoning trace before its final output, similar to OpenAI’s o-series. This is architectural, not a prompt trick. It adds tokens (and latency), but substantially improves math, coding, and multi-step logic performance.
Quantization resilience: Near-lossless accuracy under 4-bit weight quantization. The 35B-A3B runs on 32GB VRAM (RTX 3090, 4090, or Apple Silicon M3 Max with 32GB) at 4-bit. Most models degrade meaningfully at this compression level. Qwen 3.5 doesn’t — which is why the hardware requirements are practical.
Who Each Model Tier Is Actually For
Small Series: 0.8B, 2B, 4B, 9B
Who it’s for: The Small series is the most underrated addition to the family. If you’re building something that needs to run on-device — a mobile app, an edge deployment, a Raspberry Pi-class device, a latency-critical pipeline — these are your options. The 0.8B and 2B run on hardware most people actually have. The 4B is purpose-built for lightweight multimodal agents. The 9B is for local AI users on standard gaming PCs (8–16GB VRAM) who want serious capability without server hardware.
The 9B specifically: If you have an RTX 3080 (10GB VRAM) or RTX 4070, the 9B is your new daily driver. A 9B model outperforming GPT-OSS-120B on graduate-level science benchmarks while fitting on a $500 GPU is genuinely unprecedented as of this release. Run it via Ollama, LM Studio, or llama.cpp.
Look elsewhere if: You need more than 128K context (the 9B’s limit), you need the full million-token window, or you need the advanced agentic benchmarks the 122B leads on.
Medium Series: 27B Dense + 35B-A3B MoE
Who it’s for: Serious local AI users and small development teams. The 35B-A3B is the star — 1M+ token context on 32GB VRAM (single RTX 4090 or Apple Silicon Mac with 32GB unified memory). The Apache 2.0 license means no API dependency, no usage caps, no pricing changes. You own the stack. For Ollama/LM Studio/llama.cpp users, this is a direct upgrade path.
The 27B dense is the simpler deployment option: predictable compute (no MoE routing), 262K context, ties GPT-5-mini on SWE-bench at 72.4. If you’re building a document processing pipeline and want something straightforward to serve, start with the 27B.
Look elsewhere if: You’re on less than 24GB VRAM (the 9B handles that case better), or you need the enterprise-grade agentic benchmarks the 122B leads on.
Large Series: 122B-A10B
Who it’s for: Enterprise teams running on-prem infrastructure. Regulated industries — finance, healthcare, legal — where sending data to external APIs is either prohibited or expensive to comply with. The 122B-A10B runs on server-grade hardware (NVIDIA DGX Spark, H100, A100, or 80GB VRAM multi-GPU setups) and leads the family on agentic benchmarks: BFCL-V4 at 72.2, BrowseComp at 63.8, Terminal-Bench 2 at 49.4. If you need the best tool-calling performance in the Qwen family and you have the hardware, this is the one.
Look elsewhere if: You don’t have server-grade hardware — the 35B-A3B covers consumer deployments better. Or if you just want cheap API access — Flash at $0.10/M input is the better choice there.
Qwen 3.5 vs Competitors: 4-Way Comparison
| Qwen3.5-35B-A3B | Qwen3.5-9B | GPT-5-mini | Llama 3.3-70B | |
|---|---|---|---|---|
| Parameters | 35B total / 3B active | 9B dense | Unknown (proprietary) | 70B dense |
| Context Window | 1M+ tokens | 128K tokens | 128K tokens | 128K tokens |
| License | Apache 2.0 | Apache 2.0 | Proprietary API | Llama Community |
| API Pricing (Input) | Free (self-host) / $0.10 via Flash | Free (self-host) | $0.40/M | ~$0.59/M (via providers) |
| Self-Hostable | ✅ (32GB VRAM) | ✅ (8–16GB VRAM) | ❌ | ✅ (48GB VRAM) |
| Thinking Mode | ✅ Native | ✅ Native | ✅ (o-series) | ❌ |
| Native Multimodal | ✅ | ✅ | ✅ | ❌ (text-only base) |
| GPQA Diamond | — | 81.7 | — | — |
| SWE-bench Verified | — | — | 72.4 | — |
| Best For | Local frontier AI, long context | 8–16GB GPU users, smart local model | Ecosystem integration, reliability | Established open-source baseline |
Key Features
1. Million-Token Context (35B-A3B, 122B-A10B, Flash)
Not bolted-on marketing. The Gated Delta Network architecture is built for it — linear attention scaling means the compute cost doesn’t explode at long contexts. Practical use: load an entire codebase, full legal document set, or long conversation history. The limitation: 32GB VRAM minimum for local use, and Thinking Mode adds latency on very long contexts.
2. Built-In Thinking Mode (All Models)
Every Qwen 3.5 model generates an internal chain of thought before producing output. On math, coding, and multi-step reasoning tasks, this meaningfully improves results. The tradeoff: more tokens generated = slower time-to-first-token. For batch processing, irrelevant. For interactive chat, noticeable.
3. Apache 2.0 License (All Except Flash)
Commercial use, fine-tuning, redistribution — all permitted. No usage policy that can change on you. No dependency on Alibaba’s infrastructure except for Flash. If data sovereignty or API reliability is a concern, the open-weight models eliminate both.
4. Native Multimodal (All Models)
Vision-language capability baked in from training — early fusion on trillions of multimodal tokens, per the GitHub docs. Not a fine-tuned add-on. The 4B is specifically positioned as a multimodal agent base model for lightweight deployments.
5. Agentic Tool Calling (All Models)
Structured function calling supported across the family. The 122B-A10B leads on agentic benchmarks (BFCL-V4: 72.2, BrowseComp: 63.8). All models slot into LangChain, AutoGen, and other agent frameworks without custom work. Long context windows make them practical for multi-tool agent sessions where history accumulates.
6. 201-Language Support
Expanded from Qwen 3’s multilingual coverage. Matters practically for MMMLU (Massive Multitask Multilingual) benchmark performance — where the family consistently beats proprietary alternatives. The limitation Dario Amodei from Anthropic noted: Chinese AI models can be optimized for benchmark performance that doesn’t fully translate to real-world quality on all languages. Worth testing for non-English, non-Chinese applications before committing.
What Isn’t Great
- Hardware requirements are still real. “Runs on consumer hardware” means the 35B-A3B needs 32GB VRAM. The 9B is the realistic option for 8–16GB GPUs. The 0.8B/2B for anything smaller. Most casual users don’t have 32GB VRAM — the Small series is the practical answer for that gap.
- Thinking Mode adds latency. Every model generates a reasoning chain before output. Interactive chat applications will notice. For batch processing or document workflows, it doesn’t matter. You can’t toggle it off at the architecture level.
- Flash is still proprietary. The one API-only model trades the Apache 2.0 freedom for managed access. You’re back to Alibaba’s infrastructure and pricing decisions — same constraint as any managed API.
- Independent benchmarks still incoming. Alibaba’s numbers are the only source right now. The Qwen series has had a credible track record, but the community will validate (or not) over the next several weeks. Dario Amodei’s critique about benchmark optimization vs. real-world performance is a legitimate flag, especially for specialized or low-resource language use cases.
- Multilingual quality varies below the top tiers. 201 languages supported doesn’t mean 201 languages performed equally well. Test thoroughly before committing to non-major-language deployments.
How to Get Started
All open-weight models are on Hugging Face Hub and ModelScope. For local inference:
- Ollama:
ollama pull qwen3.5:9borollama pull qwen3.5:35b-a3b— simplest path, one command - LM Studio: Search “Qwen3.5” in the model browser — GUI option, good for first-time testing
- llama.cpp: Download GGUF quantized variants from Hugging Face — most control, best for production local deployments
- vLLM: Best option for multi-user serving or building an API endpoint with the 122B-A10B
For the Flash API, access is through Alibaba Cloud Model Studio. The API is compatible with OpenAI’s API spec — minimal migration effort if you’re switching from GPT-5-mini.
Pros and Cons
Pros:
- Full coverage from edge (0.8B) to enterprise (122B) — one family, one architecture
- Apache 2.0 license with no commercial restrictions on all open-weight models
- 9B beats GPT-OSS-120B on GPQA Diamond — intelligence density nobody saw coming
- Million-token context window on 32GB VRAM (35B-A3B) is architecturally sound, not marketing
- Flash at $0.10/M input is 4× cheaper than GPT-5-mini, 30× cheaper than Claude Sonnet 4.5
- Native Thinking Mode across all models — no separate o-series SKU needed
- Native multimodal (vision + text) baked into training, not fine-tuned on top
- Near-lossless 4-bit quantization keeps hardware requirements practical
Cons:
- Thinking Mode latency hits interactive applications — can’t disable at architecture level
- 32GB VRAM minimum for the flagship local model (35B-A3B) — not trivial
- Flash is API-only, proprietary — no self-hosting option if you want the 1M context convenience
- Alibaba-sourced benchmarks, not yet fully validated by independent community evals
- Benchmark optimization criticism from Anthropic’s CEO is a legitimate flag for edge-case language deployments
Bottom Line
The Small series drop on March 2 completed what the Medium series started: a coherent, Apache 2.0 open-source AI family that covers every deployment scenario. The 9B beating GPT-OSS-120B on graduate-level science benchmarks is the number that changes how people think about model scale. The 35B-A3B hitting a million-token context on consumer hardware is the number that changes what local AI users can build.
If you’re on a gaming GPU (8–16GB VRAM), the 9B is your new local model. If you have 32GB VRAM, the 35B-A3B is the upgrade. If you’re building high-volume API pipelines, Flash at $0.10/M input should be in your benchmark queue this week. If you’re an enterprise with on-prem requirements, the 122B-A10B is the most capable fully-open enterprise option at this scale.
The caveats are real — independent benchmarks are still coming in, Thinking Mode adds latency, and the hardware requirements for the flagship models aren’t trivial. But the trajectory of the Qwen series has been consistently ahead of Western AI coverage’s expectations, and this release, covering the full stack from 0.8B to 397B, doesn’t look like an exception.
Updated March 3, 2026 to include the Qwen 3.5 Small series (0.8B, 2B, 4B, 9B) released March 2, 2026. Originally published March 1, 2026. computertech.co covers AI tools since March 2022.
Frequently Asked Questions
What is Qwen 3.5?
Qwen 3.5 is Alibaba’s third-generation open-source AI model series, spanning nine models from 0.8B to 397B parameters. Released in phases from February–March 2026, most models are open-weight under Apache 2.0 license, meaning free commercial use, fine-tuning, and redistribution.
What sizes does Qwen 3.5 come in?
Qwen 3.5 covers nine sizes: Small series (0.8B, 2B, 4B, 9B — released March 2, 2026), Medium series (27B dense, 35B-A3B MoE, 122B-A10B MoE, and Flash API-only — released February 24), and the flagship 397B-A17B (released February 16). Each tier targets different hardware and deployment needs.
Which Qwen 3.5 model should I use with an 8GB or 16GB GPU?
The Qwen3.5-9B is the best choice for 8–16GB VRAM GPUs. It runs on a standard RTX 3080 or RTX 4070 and benchmarks above GPT-OSS-120B on GPQA Diamond (81.7 vs 71.5). For even less VRAM, the 4B or 2B handle most tasks with lower hardware requirements.
What GPU do I need to run Qwen3.5-35B-A3B locally?
The Qwen3.5-35B-A3B requires approximately 32GB VRAM at 4-bit quantization — an RTX 3090 (24GB, needs Q3/Q4), RTX 4090 (24GB), or Apple Silicon Mac with 32GB unified memory. Near-lossless 4-bit quantization support means minimal quality sacrifice to hit that target.
How much does Qwen 3.5 cost?
Most Qwen 3.5 models are free to self-host under Apache 2.0. The only paid option is Qwen3.5-Flash (API-only), priced at $0.10 per million input tokens and $0.40 per million output tokens — roughly 4× cheaper than GPT-5-mini and 30× cheaper than Claude Sonnet 4.5.
Is Qwen 3.5 better than GPT-5-mini?
On published benchmarks, Qwen 3.5 beats GPT-5-mini on MMMLU and MMMU-Pro. The 27B ties GPT-5-mini on SWE-bench Verified (72.4). GPT-5-mini has stronger ecosystem integration. For raw benchmark performance and cost, Qwen 3.5 wins. For ecosystem and reliability, GPT-5-mini stays competitive.
What is Thinking Mode in Qwen 3.5?
Thinking Mode is a built-in chain-of-thought reasoning system across all Qwen 3.5 models. Before producing output, the model generates an internal reasoning trace — similar to OpenAI’s o-series. This improves math, coding, and multi-step logic performance but adds latency.
Can I use Qwen 3.5 commercially?
Yes. All open-weight Qwen 3.5 models are released under Apache 2.0, which permits commercial use, modification, and redistribution without needing permission from Alibaba or paying any fees. The only exception is Qwen3.5-Flash, which is API-only.
Which Qwen 3.5 model is best for building AI agents?
The Qwen3.5-122B-A10B leads on agentic benchmarks: BFCL-V4 (72.2), BrowseComp (63.8), Terminal-Bench 2 (49.4). For lightweight agents, the 4B is purpose-built as a multimodal agent base model. All Qwen 3.5 models support structured tool calling compatible with LangChain and AutoGen.
Is Qwen 3.5 available via API without self-hosting?
Yes. Qwen3.5-Flash is API-only via Alibaba Cloud Model Studio at $0.10/M input, $0.40/M output. It uses an OpenAI-compatible API spec, making migration from existing integrations straightforward. Larger models are also available via Alibaba Cloud for managed API access.



