How does Qwen 3.5 handle long documents?

The Qwen3.5-35B-A3B, 122B-A10B, and Flash models support 1M+ token context windows. The Gated Delta Network architecture handles long sequences with near-linear compute scaling, making million-token context genuinely tractable rather than just technically possible. The 27B dense model offers 262K context.

Is Qwen 3.5 available through an API without self-hosting?

Yes. Qwen3.5-Flash is API-only through Alibaba Cloud Model Studio at $0.10/M input tokens, $0.40/M output tokens. It uses an OpenAI-compatible API spec, making migration from existing OpenAI integrations straightforward. The larger models (35B-A3B, 122B-A10B) are also available via Alibaba Cloud for managed API access.

✓

Why you can trust ComputerTech — We spend hours hands-on testing every AI tool we review, so you get honest assessments, not marketing fluff. How we review · Affiliate disclosure

Written & tested by Sawyer RuhlPublished March 1, 2026 · Updated March 5, 2026

Qwen 3.5 Review 2026: The Full Model Family — From 0.8B Edge to 122B Enterprise

Name: Qwen 3.5 Review 2026: The Full Model Family — From 0.8B Edge to 122B Enterprise
Item: Qwen 3.5
Rating: 8.2
Author: ComputerTech

Alibaba just dropped the rest of the Qwen 3.5 family. What started as a Medium-series launch on February 24, 2026 became a full-stack AI lineup when the Small series — 0.8B, 2B, 4B, and 9B — landed on March 2, 2026. The Qwen 3.5 lineup now covers every deployment scenario: phones and edge devices at the bottom, single-consumer GPUs in the middle, and server-grade infrastructure at the top. Every model carries an Apache 2.0 license. No strings.

The headline number from the Small series: the 9B model beats GPT-OSS-120B on GPQA Diamond (81.7 vs 71.5). A 9-billion parameter open-source model outperforming a 120-billion parameter proprietary model on a graduate-level science benchmark. That’s the intelligence density story r/LocalLLaMA has been losing their minds over since the drop.

This is the complete Qwen 3.5 review — all sizes, all benchmarks, pricing, who each one is actually for, and where the limits are.

Rating: 8.2/10 ⭐⭐⭐⭐

What Is Qwen 3.5?

Qwen 3.5 is Alibaba’s third-generation open-source AI model series, developed by the Qwen team at Alibaba Cloud. The full family spans nine distinct models from 0.8B to 397B parameters, with the majority released under Apache 2.0 — meaning commercial use, fine-tuning, and redistribution are all permitted without license negotiations.

The architecture breakthrough driving the family: a hybrid Gated Delta Network + Mixture-of-Experts design that delivers linear attention scaling for long sequences and activates only a fraction of total parameters per token. The result is models that punch far above their parameter count while remaining practical to run on real hardware. All models support native multimodal input (vision + text), native tool calling, and a built-in Thinking Mode for step-by-step reasoning.

Official site: qwen.ai | Models: Hugging Face Hub

The Full Qwen 3.5 Lineup: All Sizes at a Glance

The family released in three waves. Here’s the complete picture:

Model	Total Params	Active Params	Context	License	Release Date	Best For
Qwen3.5-0.8B	0.8B	0.8B (dense)	32K	Apache 2.0	2026-03-02	Edge devices, mobile, embedded
Qwen3.5-2B	2B	2B (dense)	32K	Apache 2.0	2026-03-02	Latency-sensitive apps, IoT
Qwen3.5-4B	4B	4B (dense)	32K	Apache 2.0	2026-03-02	Lightweight agents, multimodal apps
Qwen3.5-9B	9B	9B (dense)	128K	Apache 2.0	2026-03-02	Local power users, 8–16GB VRAM
Qwen3.5-27B	27B	27B (dense)	262K	Apache 2.0	2026-02-24	Document processing, simple deployment
Qwen3.5-35B-A3B	35B	3B (MoE)	1M+ tokens	Apache 2.0	2026-02-24	Local frontier AI, 32GB VRAM
Qwen3.5-Flash	~35B	~3B (MoE)	1M+ tokens	API-only	2026-02-24	High-volume API, no infra overhead
Qwen3.5-122B-A10B	122B	10B (MoE)	1M+ tokens	Apache 2.0	2026-02-24	Enterprise on-prem, server GPU
Qwen3.5-397B-A17B	397B	17B (MoE)	1M+ tokens	Apache 2.0	2026-02-16	Frontier-tier, multi-node clusters

The Story: Why r/LocalLLaMA Is Paying Attention

The Small series release on March 2 is what flipped the conversation from “interesting Medium release” to “this family is something different.” The benchmark that’s been circulating: Qwen3.5-9B scores 81.7 on GPQA Diamond, beating GPT-OSS-120B’s 71.5. GPQA Diamond is graduate-level science questions — physics, chemistry, biology — designed to stump most PhD students. A 9B model beating a 120B model on that benchmark is not a rounding error. It’s an architecture story.

Elon Musk’s reaction on X — “Impressive intelligence density” — captured what the community was already thinking. The ratio of capability to model size has jumped substantially. The Qwen3.5-4B is positioned specifically as a multimodal agent base model, and the 0.8B runs on-device inference on modern smartphones. This isn’t a research curiosity — it’s a deployable stack from edge to datacenter.

Benchmarks: The Numbers Worth Caring About

Small Series (0.8B – 9B) — Released March 2, 2026

Benchmark	Qwen3.5-9B	GPT-OSS-120B	Qwen3-Next-80B-A3B
GPQA Diamond	81.7	71.5	—
HMMT Feb 2025	83.2	76.7	—
MMMU-Pro	70.1	59.7	—
MMMLU (multilingual)	81.2	—	81.3
OmniDocBench v1.5	87.7	—	—
ERQA	55.5	44.3	—

Source: Alibaba Qwen team release benchmarks (March 2026). Note: GPT-OSS-120B benchmarks from Alibaba comparative data — third-party reproduction ongoing.

Medium Series (27B – 122B) — Released February 24, 2026

Benchmark	35B-A3B	27B	122B-A10B	GPT-5-mini	Claude Sonnet 4.5
MMMLU	Beats	Beats	Beats	Baseline	Baseline
MMMU-Pro	Beats	Beats	Beats	Baseline	Baseline
SWE-bench Verified	—	72.4	—	72.4 (tied)	—
BFCL-V4 (tool use)	—	—	72.2	—	—
BrowseComp	—	—	63.8	—	—
Terminal-Bench 2	—	—	49.4	—	—

Source: Alibaba Qwen team release data (February 2026). Independent community evals are ongoing.

Pricing

Model / API	Input (per 1M tokens)	Output (per 1M tokens)	Self-Host Option
Qwen3.5-Flash (API)	$0.10	$0.40	No (API-only)
Qwen3.5-0.8B to 9B	Free (self-host)	Free (self-host)	Yes — Apache 2.0
Qwen3.5-27B / 35B / 122B	Free (self-host)	Free (self-host)	Yes — Apache 2.0
GPT-5-mini (OpenAI)	$0.40	$1.60	No
Claude Sonnet 4.5 (Anthropic)	$3.00	$15.00	No

Qwen3.5-Flash at $0.10/M input tokens is 4× cheaper than GPT-5-mini and 30× cheaper than Claude Sonnet 4.5 at the input level. For high-volume pipelines — document processing, classification, summarization at scale — that pricing differential is the entire value proposition. Process 100M tokens/month on Flash and your input cost is $10. Same volume on Claude Sonnet 4.5 is $300.

Architecture: What’s Actually Different

The engineering story behind the benchmark numbers: Qwen 3.5 uses a hybrid Gated Delta Network + Mixture-of-Experts architecture that solves two fundamental problems simultaneously.

The MoE problem: Traditional MoE models activate a small subset of experts per token, dramatically reducing active compute. Qwen3.5-35B-A3B activates only 8.6% of its total parameters per forward pass (3B of 35B), routing each token through 8 specialized experts out of 256 total, plus 1 shared expert. You get 35B-class quality ceiling at 3B inference cost.

The long-context problem: Standard transformer attention scales quadratically with sequence length — 2× the context means 4× the compute. Qwen 3.5’s Gated Delta Networks use linear attention that scales near-linearly. The architecture alternates: 3 Gated DeltaNet layers (linear attention) for every 1 full attention layer. The result is a million-token context window that’s architecturally designed to be tractable, not just technically possible.

Thinking Mode: All models ship with built-in chain-of-thought reasoning — the model generates an internal reasoning trace before its final output, similar to OpenAI’s o-series. This is architectural, not a prompt trick. It adds tokens (and latency), but substantially improves math, coding, and multi-step logic performance.

Quantization resilience: Near-lossless accuracy under 4-bit weight quantization. The 35B-A3B runs on 32GB VRAM (RTX 3090, 4090, or Apple Silicon M3 Max with 32GB) at 4-bit. Most models degrade meaningfully at this compression level. Qwen 3.5 doesn’t — which is why the hardware requirements are practical.

Who Each Model Tier Is Actually For

Small Series: 0.8B, 2B, 4B, 9B

Who it’s for: The Small series is the most underrated addition to the family. If you’re building something that needs to run on-device — a mobile app, an edge deployment, a Raspberry Pi-class device, a latency-critical pipeline — these are your options. The 0.8B and 2B run on hardware most people actually have. The 4B is purpose-built for lightweight multimodal agents. The 9B is for local AI users on standard gaming PCs (8–16GB VRAM) who want serious capability without server hardware.

The 9B specifically: If you have an RTX 3080 (10GB VRAM) or RTX 4070, the 9B is your new daily driver. A 9B model outperforming GPT-OSS-120B on graduate-level science benchmarks while fitting on a $500 GPU is genuinely unprecedented as of this release. Run it via Ollama, LM Studio, or llama.cpp.

Look elsewhere if: You need more than 128K context (the 9B’s limit), you need the full million-token window, or you need the advanced agentic benchmarks the 122B leads on.

Medium Series: 27B Dense + 35B-A3B MoE

Who it’s for: Serious local AI users and small development teams. The 35B-A3B is the star — 1M+ token context on 32GB VRAM (single RTX 4090 or Apple Silicon Mac with 32GB unified memory). The Apache 2.0 license means no API dependency, no usage caps, no pricing changes. You own the stack. For Ollama/LM Studio/llama.cpp users, this is a direct upgrade path.

The 27B dense is the simpler deployment option: predictable compute (no MoE routing), 262K context, ties GPT-5-mini on SWE-bench at 72.4. If you’re building a document processing pipeline and want something straightforward to serve, start with the 27B.

Look elsewhere if: You’re on less than 24GB VRAM (the 9B handles that case better), or you need the enterprise-grade agentic benchmarks the 122B leads on.

Large Series: 122B-A10B

Who it’s for: Enterprise teams running on-prem infrastructure. Regulated industries — finance, healthcare, legal — where sending data to external APIs is either prohibited or expensive to comply with. The 122B-A10B runs on server-grade hardware (NVIDIA DGX Spark, H100, A100, or 80GB VRAM multi-GPU setups) and leads the family on agentic benchmarks: BFCL-V4 at 72.2, BrowseComp at 63.8, Terminal-Bench 2 at 49.4. If you need the best tool-calling performance in the Qwen family and you have the hardware, this is the one.

Look elsewhere if: You don’t have server-grade hardware — the 35B-A3B covers consumer deployments better. Or if you just want cheap API access — Flash at $0.10/M input is the better choice there.

Qwen 3.5 vs Competitors: 4-Way Comparison

	Qwen3.5-35B-A3B	Qwen3.5-9B	GPT-5-mini	Llama 3.3-70B
Parameters	35B total / 3B active	9B dense	Unknown (proprietary)	70B dense
Context Window	1M+ tokens	128K tokens	128K tokens	128K tokens
License	Apache 2.0	Apache 2.0	Proprietary API	Llama Community
API Pricing (Input)	Free (self-host) / $0.10 via Flash	Free (self-host)	$0.40/M	~$0.59/M (via providers)
Self-Hostable	✅ (32GB VRAM)	✅ (8–16GB VRAM)	❌	✅ (48GB VRAM)
Thinking Mode	✅ Native	✅ Native	✅ (o-series)	❌
Native Multimodal	✅	✅	✅	❌ (text-only base)
GPQA Diamond	—	81.7	—	—
SWE-bench Verified	—	—	72.4	—
Best For	Local frontier AI, long context	8–16GB GPU users, smart local model	Ecosystem integration, reliability	Established open-source baseline

Key Features

1. Million-Token Context (35B-A3B, 122B-A10B, Flash)
Not bolted-on marketing. The Gated Delta Network architecture is built for it — linear attention scaling means the compute cost doesn’t explode at long contexts. Practical use: load an entire codebase, full legal document set, or long conversation history. The limitation: 32GB VRAM minimum for local use, and Thinking Mode adds latency on very long contexts.

2. Built-In Thinking Mode (All Models)
Every Qwen 3.5 model generates an internal chain of thought before producing output. On math, coding, and multi-step reasoning tasks, this meaningfully improves results. The tradeoff: more tokens generated = slower time-to-first-token. For batch processing, irrelevant. For interactive chat, noticeable.

3. Apache 2.0 License (All Except Flash)
Commercial use, fine-tuning, redistribution — all permitted. No usage policy that can change on you. No dependency on Alibaba’s infrastructure except for Flash. If data sovereignty or API reliability is a concern, the open-weight models eliminate both.

4. Native Multimodal (All Models)
Vision-language capability baked in from training — early fusion on trillions of multimodal tokens, per the GitHub docs. Not a fine-tuned add-on. The 4B is specifically positioned as a multimodal agent base model for lightweight deployments.

5. Agentic Tool Calling (All Models)
Structured function calling supported across the family. The 122B-A10B leads on agentic benchmarks (BFCL-V4: 72.2, BrowseComp: 63.8). All models slot into LangChain, AutoGen, and other agent frameworks without custom work. Long context windows make them practical for multi-tool agent sessions where history accumulates.

6. 201-Language Support
Expanded from Qwen 3’s multilingual coverage. Matters practically for MMMLU (Massive Multitask Multilingual) benchmark performance — where the family consistently beats proprietary alternatives. The limitation Dario Amodei from Anthropic noted: Chinese AI models can be optimized for benchmark performance that doesn’t fully translate to real-world quality on all languages. Worth testing for non-English, non-Chinese applications before committing.

What Isn’t Great

Hardware requirements are still real. “Runs on consumer hardware” means the 35B-A3B needs 32GB VRAM. The 9B is the realistic option for 8–16GB GPUs. The 0.8B/2B for anything smaller. Most casual users don’t have 32GB VRAM — the Small series is the practical answer for that gap.
Thinking Mode adds latency. Every model generates a reasoning chain before output. Interactive chat applications will notice. For batch processing or document workflows, it doesn’t matter. You can’t toggle it off at the architecture level.
Flash is still proprietary. The one API-only model trades the Apache 2.0 freedom for managed access. You’re back to Alibaba’s infrastructure and pricing decisions — same constraint as any managed API.
Independent benchmarks still incoming. Alibaba’s numbers are the only source right now. The Qwen series has had a credible track record, but the community will validate (or not) over the next several weeks. Dario Amodei’s critique about benchmark optimization vs. real-world performance is a legitimate flag, especially for specialized or low-resource language use cases.
Multilingual quality varies below the top tiers. 201 languages supported doesn’t mean 201 languages performed equally well. Test thoroughly before committing to non-major-language deployments.

How to Get Started

All open-weight models are on Hugging Face Hub and ModelScope. For local inference:

Ollama: ollama pull qwen3.5:9b or ollama pull qwen3.5:35b-a3b — simplest path, one command
LM Studio: Search “Qwen3.5” in the model browser — GUI option, good for first-time testing
llama.cpp: Download GGUF quantized variants from Hugging Face — most control, best for production local deployments
vLLM: Best option for multi-user serving or building an API endpoint with the 122B-A10B

For the Flash API, access is through Alibaba Cloud Model Studio. The API is compatible with OpenAI’s API spec — minimal migration effort if you’re switching from GPT-5-mini.

Pros and Cons

Pros:

Full coverage from edge (0.8B) to enterprise (122B) — one family, one architecture
Apache 2.0 license with no commercial restrictions on all open-weight models
9B beats GPT-OSS-120B on GPQA Diamond — intelligence density nobody saw coming
Million-token context window on 32GB VRAM (35B-A3B) is architecturally sound, not marketing
Flash at $0.10/M input is 4× cheaper than GPT-5-mini, 30× cheaper than Claude Sonnet 4.5
Native Thinking Mode across all models — no separate o-series SKU needed
Native multimodal (vision + text) baked into training, not fine-tuned on top
Near-lossless 4-bit quantization keeps hardware requirements practical

Cons:

Thinking Mode latency hits interactive applications — can’t disable at architecture level
32GB VRAM minimum for the flagship local model (35B-A3B) — not trivial
Flash is API-only, proprietary — no self-hosting option if you want the 1M context convenience
Alibaba-sourced benchmarks, not yet fully validated by independent community evals
Benchmark optimization criticism from Anthropic’s CEO is a legitimate flag for edge-case language deployments

Bottom Line

The Small series drop on March 2 completed what the Medium series started: a coherent, Apache 2.0 open-source AI family that covers every deployment scenario. The 9B beating GPT-OSS-120B on graduate-level science benchmarks is the number that changes how people think about model scale. The 35B-A3B hitting a million-token context on consumer hardware is the number that changes what local AI users can build.

If you’re on a gaming GPU (8–16GB VRAM), the 9B is your new local model. If you have 32GB VRAM, the 35B-A3B is the upgrade. If you’re building high-volume API pipelines, Flash at $0.10/M input should be in your benchmark queue this week. If you’re an enterprise with on-prem requirements, the 122B-A10B is the most capable fully-open enterprise option at this scale.

The caveats are real — independent benchmarks are still coming in, Thinking Mode adds latency, and the hardware requirements for the flagship models aren’t trivial. But the trajectory of the Qwen series has been consistently ahead of Western AI coverage’s expectations, and this release, covering the full stack from 0.8B to 397B, doesn’t look like an exception.

Updated March 3, 2026 to include the Qwen 3.5 Small series (0.8B, 2B, 4B, 9B) released March 2, 2026. Originally published March 1, 2026. computertech.co covers AI tools since March 2022.

📖 Related Articles

Frequently Asked Questions

What is Qwen 3.5?

Qwen 3.5 is Alibaba’s third-generation open-source AI model series, spanning nine models from 0.8B to 397B parameters. Released in phases from February–March 2026, most models are open-weight under Apache 2.0 license, meaning free commercial use, fine-tuning, and redistribution.

What sizes does Qwen 3.5 come in?

Qwen 3.5 covers nine sizes: Small series (0.8B, 2B, 4B, 9B — released March 2, 2026), Medium series (27B dense, 35B-A3B MoE, 122B-A10B MoE, and Flash API-only — released February 24), and the flagship 397B-A17B (released February 16). Each tier targets different hardware and deployment needs.

Which Qwen 3.5 model should I use with an 8GB or 16GB GPU?

The Qwen3.5-9B is the best choice for 8–16GB VRAM GPUs. It runs on a standard RTX 3080 or RTX 4070 and benchmarks above GPT-OSS-120B on GPQA Diamond (81.7 vs 71.5). For even less VRAM, the 4B or 2B handle most tasks with lower hardware requirements.

What GPU do I need to run Qwen3.5-35B-A3B locally?

The Qwen3.5-35B-A3B requires approximately 32GB VRAM at 4-bit quantization — an RTX 3090 (24GB, needs Q3/Q4), RTX 4090 (24GB), or Apple Silicon Mac with 32GB unified memory. Near-lossless 4-bit quantization support means minimal quality sacrifice to hit that target.

How much does Qwen 3.5 cost?

Most Qwen 3.5 models are free to self-host under Apache 2.0. The only paid option is Qwen3.5-Flash (API-only), priced at $0.10 per million input tokens and $0.40 per million output tokens — roughly 4× cheaper than GPT-5-mini and 30× cheaper than Claude Sonnet 4.5.

Is Qwen 3.5 better than GPT-5-mini?

On published benchmarks, Qwen 3.5 beats GPT-5-mini on MMMLU and MMMU-Pro. The 27B ties GPT-5-mini on SWE-bench Verified (72.4). GPT-5-mini has stronger ecosystem integration. For raw benchmark performance and cost, Qwen 3.5 wins. For ecosystem and reliability, GPT-5-mini stays competitive.

What is Thinking Mode in Qwen 3.5?

Thinking Mode is a built-in chain-of-thought reasoning system across all Qwen 3.5 models. Before producing output, the model generates an internal reasoning trace — similar to OpenAI’s o-series. This improves math, coding, and multi-step logic performance but adds latency.

Can I use Qwen 3.5 commercially?

Yes. All open-weight Qwen 3.5 models are released under Apache 2.0, which permits commercial use, modification, and redistribution without needing permission from Alibaba or paying any fees. The only exception is Qwen3.5-Flash, which is API-only.

Which Qwen 3.5 model is best for building AI agents?

The Qwen3.5-122B-A10B leads on agentic benchmarks: BFCL-V4 (72.2), BrowseComp (63.8), Terminal-Bench 2 (49.4). For lightweight agents, the 4B is purpose-built as a multimodal agent base model. All Qwen 3.5 models support structured tool calling compatible with LangChain and AutoGen.

Is Qwen 3.5 available via API without self-hosting?

Yes. Qwen3.5-Flash is API-only via Alibaba Cloud Model Studio at $0.10/M input, $0.40/M output. It uses an OpenAI-compatible API spec, making migration from existing integrations straightforward. Larger models are also available via Alibaba Cloud for managed API access.

ComputerTech Editorial Team

Our team tests every AI tool hands-on before reviewing it. With 126+ tools evaluated across 8 categories, we focus on real-world performance, honest pricing analysis, and practical recommendations. Learn more about our review process →