You’re staring at your API bill from OpenAI or Anthropic, watching the numbers climb every month, and thinking: “There has to be a better way.” Maybe you’ve tried open-source models before — downloaded some quantized thing from Hugging Face, got mediocre results, and went back to paying the cloud tax. That’s been the open-source AI experience for most of us. Until now.
StepFun, a Chinese AI lab that’s been quietly building models for years, just dropped Step 3.5 Flash — a 196 billion parameter open-source model that, according to their benchmarks, trades blows with GPT-5.2 and Claude Opus 4.5. The twist? It only activates 11 billion parameters per token. Think of it like a library with 196 billion books, but a very efficient librarian who only pulls the 11 billion you actually need for each question. The result is frontier-level intelligence at a fraction of the computational cost.
The Hacker News crowd is buzzing about it. Developers are running it locally on Mac Studios. And the benchmarks — if they hold up — suggest this could be the most important open-source model release of 2026 so far.
Based on our research into Step 3.5 Flash’s official documentation, benchmark data, community feedback, and technical specifications, here’s what you need to know before you commit your GPU hours to this model.
What Is Step 3.5 Flash?
Step 3.5 Flash is an open-source foundation model built by StepFun (阶跃AI), a Chinese AI company that’s been in the large language model space for a couple of years but has only recently gained significant attention in the English-speaking AI community. They’re also believed to be connected to ACEStep, a popular open-source music generation model, though the two projects operate under separate organizations on Hugging Face.
The model uses a sparse Mixture of Experts (MoE) architecture — the same approach that made models like DeepSeek V3 so efficient. But Step 3.5 Flash pushes this further. While DeepSeek V3.2 activates 37 billion of its 671 billion parameters, Step 3.5 Flash activates just 11 billion of 196 billion. That’s a dramatically smaller active footprint, which translates to faster inference and lower hardware requirements.
StepFun calls this “intelligence density” — packing as much capability as possible into as few active parameters as possible. According to their published benchmarks, the result is a model that matches or exceeds much larger competitors on reasoning, coding, and agentic tasks.
Key Features That Actually Matter
Mixture of Experts Done Right
The architecture uses 288 routed experts per layer plus one shared expert that’s always active. For each token, only the top 8 experts are selected. This means the model has the “memory” and learned patterns of a 196B model but runs with the speed of an 11B model. It’s like having a team of 289 specialists but only consulting 9 of them for any given question — the trick is knowing which 9 to call.
Multi-Token Prediction (MTP-3)
Most language models predict one token at a time. Step 3.5 Flash predicts four tokens simultaneously using what StepFun calls MTP-3 (3-way Multi-Token Prediction). According to their documentation, this achieves 100 to 300 tokens per second in typical usage, peaking at 350 tokens per second for single-stream coding tasks. For context, that’s genuinely fast — fast enough for real-time interaction even during complex reasoning chains.
256K Context Window
The model supports a 256K token context window using a hybrid attention approach: three Sliding Window Attention (SWA) layers for every one full-attention layer. This 3:1 ratio keeps computational costs manageable while still handling massive documents or codebases. Developers on Hacker News report that on a 128GB Mac, you can run the full 256K context or two simultaneous 128K streams.
Built for Coding and Agents
This isn’t a general-purpose chatbot that happens to write code. Step 3.5 Flash was specifically trained for agentic tasks — it’s designed to use tools, execute code, browse the web, and chain together complex multi-step workflows. If you’ve been using models like Cursor or Kilo Code for AI-assisted coding, Step 3.5 Flash is the kind of model that powers those experiences, except you can run it yourself.
Benchmark Performance: The Numbers (With Caveats)
Here’s where it gets interesting — and where we need to be honest about what benchmarks actually tell us.
According to StepFun’s published results, Step 3.5 Flash achieves an average score of 81.0 across their benchmark suite. For comparison:
| Model | Total Params | Active Params | Avg Score |
|---|---|---|---|
| GPT-5.2 xhigh | Unknown | Unknown | 82.2 |
| Step 3.5 Flash | 196B | 11B | 81.0 |
| Gemini 3.0 Pro | Unknown | Unknown | 80.7 |
| Claude Opus 4.5 | Unknown | Unknown | 80.6 |
| Kimi K2.5 | 1T | 32B | 80.5 |
| GLM-4.7 | 355B | 32B | 78.5 |
| DeepSeek V3.2 | 671B | 37B | 77.3 |
Some standout numbers from specific benchmarks:
- AIME 2025 (math reasoning): 97.3 — within striking distance of GPT-5.2’s perfect 100
- SWE-bench Verified (coding): 74.4% — competitive with Claude Opus 4.5’s 80.9% and ahead of DeepSeek V3.2’s 73.1%
- Terminal-Bench 2.0: 51.0% — beats DeepSeek V3.2 (46.4%) and GLM-4.7 (41.0%)
- τ²-Bench (agent tasks): 88.2 — above every competitor except Claude Opus 4.5 (92.5) and Gemini 3.0 Pro (90.7)
The honest caveat: These are self-reported benchmarks from StepFun. Every AI lab cherry-picks the benchmarks where their model shines. Community members on Hacker News have noted that real-world performance can diverge from these numbers. One user reported hallucination issues on simple factual queries, while others praised its coding abilities. As with any model, take the benchmarks as directional, not gospel.
Real-World Community Feedback
Step 3.5 Flash exploded on Hacker News with over 150 upvotes and extensive discussion. Here’s what actual developers are reporting:
The Good
- Impressive local performance: One developer running a 4-bit quantized version on an M1 Ultra Mac reported 36 tokens/second for generation and 300 tokens/second for prompt processing — and importantly, these speeds degrade slowly even at 100K+ context
- Strong agentic coding: Multiple users report it works well with CLI coding harnesses like pi.dev, calling it “the best experience I had with a local LLM doing agentic coding”
- Efficient reasoning: One user found it solved problems in fewer API calls than Claude Opus 4.6, despite spending more time thinking — 2 calls vs. 5 for the same task
- Context efficiency: The SWA approach means you can run multiple streams on consumer hardware
The Not-So-Good
- Hallucination problems: Users report it hallucinates on factual queries that other models handle cleanly. One developer tested it on specific domain knowledge and found it significantly less reliable than Claude Opus or DeepSeek
- Excessive reasoning chains: The model compensates for its smaller active parameter count by thinking longer. Multiple users report “the amount of reasoning output could fill a small book” even for simple tasks
- Infinite reasoning loops: There’s a known bug where the model sometimes enters an infinite reasoning loop. StepFun has acknowledged this, and a fix is expected in a future release
- Bare-bones code output: Some developers note that while it reasons extensively, the actual code output can be more minimal compared to GLM-4.7 or Kimi K2.5
Here’s what other reviews don’t tell you: the model’s efficiency is a double-edged sword. Yes, it activates fewer parameters and runs faster. But it compensates by generating far more reasoning tokens. So your actual wall-clock time for complex tasks might not be much better than a bigger, slower model that thinks less but acts more.
How to Access Step 3.5 Flash
One of Step 3.5 Flash’s biggest advantages over proprietary models is the variety of ways you can use it:
Cloud API (Easiest)
OpenRouter currently offers a free trial for Step 3.5 Flash. You can sign up at openrouter.ai, grab an API key, and start making calls using the standard OpenAI SDK. The model name is stepfun/step-3.5-flash.
StepFun’s own platform is available at platform.stepfun.ai with their own API endpoint. Note that the main stepfun.com website may redirect non-Chinese users — use the platform subdomain directly.
Local Deployment (Most Control)
The full model weights are available on Hugging Face under the stepfun-ai organization. Supported inference backends include:
- vLLM — Recommended by StepFun, though full MTP-3 support is still being integrated
- SGLang — Alternative high-performance backend
- llama.cpp — For running quantized versions locally. Community-created GGUF quantizations are available
- Hugging Face Transformers — Standard Python inference
Hardware requirements for local deployment: For the full bf16 model, you’ll need 8x NVIDIA GPUs with tensor parallelism. For quantized versions (4-bit), developers report running it successfully on a Mac Studio M4 Max or even an M1 Ultra with 128GB unified memory.
Step 3.5 Flash vs. The Competition
How does Step 3.5 Flash stack up against other models you might be considering?
vs. DeepSeek V3.2
DeepSeek V3.2 is the most direct comparison — both are open-source Chinese MoE models. DeepSeek is larger (671B total, 37B active) and more established. According to the published benchmarks, Step 3.5 Flash outperforms DeepSeek on most metrics while being roughly 6x cheaper to run at inference time. However, DeepSeek has a much larger established community and broader third-party tooling support.
vs. Claude Opus 4.5 / GPT-5.2
The proprietary frontier models still lead in absolute capability, especially on complex coding tasks (Claude Opus 4.5 scores 80.9% on SWE-bench vs. Step 3.5 Flash’s 74.4%) and agent benchmarks. But they cost real money per token, and you can’t run them locally. If data privacy, cost control, or self-hosting matter to you, Step 3.5 Flash is in a different league than anything else available. Tools like OpenAI’s Codex and Claude Cowork offer polished experiences but lock you into their ecosystems.
vs. GLM-4.7 and Kimi K2.5
GLM-4.7 (355B) and Kimi K2.5 (1T) are other Chinese open-source options. Step 3.5 Flash beats GLM-4.7 on most benchmarks despite being roughly half the size. Kimi K2.5 is competitive but at 1 trillion parameters, it’s approximately 19x more expensive to run. Step 3.5 Flash hits the sweet spot of capability per compute dollar.
vs. Using It Inside Coding Tools
If you’re not running models directly but using AI coding tools, here’s the practical comparison: Windsurf, Augment Code, and Cursor all use proprietary model backends. Step 3.5 Flash could theoretically power any of these tools if they supported custom model backends. The HN community reports it works well with pi.dev and Claude Code’s harness, meaning you can get similar agentic coding experiences without paying per-token API costs.
Pricing and Cost Analysis
This is where Step 3.5 Flash’s value proposition becomes crystal clear.
- OpenRouter: Currently offering a free trial. Check their pricing page for current rates after the trial period
- StepFun Platform: API pricing available at platform.stepfun.ai — visit for current rates
- Self-hosted: Free (the weights are open-source), but you’re paying for hardware. A Mac Studio M4 Max that can run the 4-bit quant costs about $4,000 — which pays for itself quickly if you’re currently spending hundreds per month on API calls
According to StepFun’s own cost analysis, Step 3.5 Flash’s estimated decoding cost is approximately 1/6th of DeepSeek V3.2 and 1/19th of Kimi K2.5 at 128K context on comparable GPU hardware. If those numbers hold in practice, this is the most cost-efficient frontier-class model available today.
Who Is Step 3.5 Flash For?
Ideal users:
- Developers who want local AI coding agents — If you want an agentic coding assistant that doesn’t phone home, this is the best open-source option currently available
- Companies with data privacy requirements — Run it on your own hardware, keep your code and data completely private
- AI researchers and tinkerers — Open weights mean you can fine-tune, study, and modify the model
- Cost-conscious teams — If your API bills from OpenAI or Anthropic are climbing and you have GPU capacity, switching to self-hosted Step 3.5 Flash could dramatically cut costs
- Mac users with high-end hardware — The M-series Apple Silicon support makes this accessible to a surprisingly wide audience
Not ideal for:
- Non-technical users — There’s no polished chat interface. This is a model, not a product. You need to know how to set up inference pipelines or use API clients
- Tasks requiring high factual accuracy — Community reports suggest hallucination is a real concern for knowledge-heavy queries
- People who need plug-and-play — If you want something that “just works” like ChatGPT or Claude, stick with those. Step 3.5 Flash requires technical setup
- Anyone without appropriate hardware — Even quantized, this model needs serious compute for local deployment
Honest Take: Should You Care About Step 3.5 Flash?
Here’s our honest assessment: Step 3.5 Flash represents a genuine shift in what’s possible with open-source AI. A year ago, the idea that an open model could seriously compete with GPT-5 and Claude Opus on reasoning benchmarks would have seemed far-fetched. Now it’s happening — and from a company most English-speaking developers hadn’t heard of until this week.
But let’s not get carried away. The benchmarks are self-reported. The model has known bugs (infinite reasoning loops). It hallucinates on factual queries. And the “efficiency” advantage gets complicated when you factor in its tendency to generate enormous reasoning chains.
What’s genuinely exciting is the trajectory. If StepFun can deliver this level of capability with 11B active parameters, what comes next? And what does it mean for the pricing of proprietary models when open alternatives keep closing the gap?
For developers who are comfortable with open-source tooling and want to run powerful AI locally, Step 3.5 Flash is worth serious attention. For everyone else, it’s worth watching — because models like this are what eventually force OpenAI and Anthropic to drop their prices.
Frequently Asked Questions
Is Step 3.5 Flash really free to use?
The model weights are open-source and freely downloadable from Hugging Face. You can run it locally at no cost beyond your own hardware. For cloud access, OpenRouter is currently offering a free trial, and StepFun has their own API platform. Check both providers’ websites for current pricing after trial periods end.
Can I run Step 3.5 Flash on my Mac?
Yes, but you need significant hardware. Developers report running 4-bit quantized versions on Mac Studio M4 Max and M1 Ultra systems with 128GB unified memory. Standard MacBooks with 8-16GB won’t cut it. With llama.cpp and a GGUF quantization, you can get around 36 tokens per second on an M1 Ultra.
How does Step 3.5 Flash compare to DeepSeek V3?
According to published benchmarks, Step 3.5 Flash outperforms DeepSeek V3.2 on most metrics (81.0 vs. 77.3 average score) while being approximately 6x cheaper to run at inference time. DeepSeek has a larger community and better third-party support. Both are open-source MoE models from Chinese AI labs.
Is Step 3.5 Flash good for coding?
According to benchmarks, it scores 74.4% on SWE-bench Verified and 51.0% on Terminal-Bench 2.0. Community developers report positive experiences using it for agentic coding with CLI tools. However, some users note it produces verbose reasoning chains and sometimes more minimal code output compared to alternatives.
What hardware do I need for local deployment?
For full precision (bf16), you need 8x NVIDIA GPUs with tensor parallelism. For quantized versions (4-bit), a Mac Studio with M4 Max or M1 Ultra and at least 128GB unified memory works. NVIDIA DGX Spark is another option. Consumer GPUs with less memory won’t run the model effectively.
Does Step 3.5 Flash support tool use and function calling?
Yes, it’s specifically designed for agentic tasks with tool use. According to StepFun’s documentation, it supports complex multi-tool orchestration, code execution within reasoning chains, and integration with frameworks like Claude Code. It scored 88.2 on τ²-Bench, an agent-focused benchmark.
What are the known issues with Step 3.5 Flash?
The main reported issues include: occasional infinite reasoning loops (a known bug that StepFun is working on), hallucination on factual queries, very long reasoning chain outputs that can increase response time, and the model sometimes producing simpler code output than expected given its benchmark scores.
Who is StepFun and can I trust them?
StepFun (阶跃AI) is a Chinese AI company that has been developing language models for several years. They also created ACEStep, a respected open-source music generation model. Because the model weights are open-source and available on Hugging Face, you don’t need to trust the company — you can inspect, run, and verify the model yourself.



