OpenAI dropped GPT-5.4 today, March 5, 2026 — and it immediately crossed a threshold no general-purpose AI model has crossed before: 75.0% on OSWorld-Verified, beating the 72.4% human baseline at real computer operation. That’s not a marketing bullet point. It’s the benchmark equivalent of a model sitting down at someone’s keyboard and outperforming them. Meanwhile, on GDPval — a test of professional knowledge work across 44 occupations — GPT-5.4 matches or beats industry professionals 83% of the time, jumping 12 points over GPT-5.2’s 70.9%. If you do any serious knowledge work or build AI agents, this launch deserves your full attention.
What Is GPT-5.4?
GPT-5.4 is OpenAI’s current frontier model, released March 5, 2026. It’s available in three places simultaneously: ChatGPT (as GPT-5.4 Thinking, replacing GPT-5.2 Thinking for Plus, Team, and Pro subscribers), the OpenAI API (model ID: gpt-5.4), and Codex for software engineering workflows. A separate GPT-5.4 Pro variant is available in ChatGPT Pro ($200/mo) and the API (gpt-5.4-pro) for maximum performance on the most demanding tasks.
This is not a point release. GPT-5.4 fuses the coding capabilities of GPT-5.3-Codex (if you're picking AI coding tools, our Cursor vs Windsurf vs GitHub Copilot breakdown covers the alternatives) with GPT-5.2’s general reasoning foundation, then adds native computer-use abilities, a 1M token context window, and a dramatically more efficient tool-use engine. The result is a single model that handles professional knowledge work, autonomous computer operation, agentic coding, and long-horizon reasoning — without needing separate specialist models for each task. Official announcement: openai.com/index/introducing-gpt-5-4/
One key note: GPT-5.4 Thinking context windows in ChatGPT remain the same as GPT-5.2 Thinking. The 1M context window is available in Codex and the API (with requests over 272K counting at 2x rate against limits).
The Story: First General-Purpose Model to Beat Humans at Computer Use
Let’s be precise about what “beats humans” means here, because it matters. OSWorld-Verified is a benchmark that tests AI models on real GUI-based desktop tasks — not synthetic prompts, but actual software operation: navigating file systems, filling spreadsheet cells, managing browsers, executing multi-step workflows in real applications. The human performance baseline on this benchmark is 72.4%. GPT-5.4 scores 75.0%. That’s not a rounding error — it’s a clear crossover.
Every previous AI model that could “use a computer” was either a specialist model (Claude’s Computer Use API, GPT-5.2 with vision-only scripting) or performed well below human level. GPT-5.4 embeds this capability natively into the mainstream model. You’re not installing a separate plugin or enabling a beta feature — computer use ships as part of the base model in Codex and the API.
The spreadsheet angle deserves its own paragraph. On an internal OpenAI benchmark of investment banking analyst modeling tasks — the kind of Excel work a junior IB analyst making $150K/year gets paid to do — GPT-5.4 scores 87.3% vs GPT-5.2’s 68.4%. That’s a 19-point jump on real financial modeling. The ChatGPT for Excel add-in launched the same day as GPT-5.4, which is not a coincidence.
Benchmark Performance
All benchmarks sourced from OpenAI’s official launch page. Reasoning effort set to xhigh unless noted.
| Benchmark | What It Tests | GPT-5.4 | GPT-5.3-Codex | GPT-5.2 |
|---|---|---|---|---|
| GDPval (wins or ties) | Professional knowledge work, 44 occupations | 83.0% | 70.9% | 70.9% |
| OSWorld-Verified | Desktop GUI operation (human = 72.4%) | 75.0% | 74.0% | 47.3% |
| SWE-Bench Pro (Public) | Real-world software engineering tasks | 57.7% | 56.8% | 55.6% |
| BrowseComp | Agentic web research, hard-to-find info | 82.7% | 77.3% | 65.8% |
| Toolathlon | Real-world multi-step tool/API use | 54.6% | 51.9% | 46.3% |
| MMMU-Pro (no tools) | Visual understanding and reasoning | 81.2% | — | 79.5% |
| IB Modeling Tasks (internal) | Spreadsheet financial modeling | 87.3% | 79.3% | 68.4% |
| WebArena-Verified | Browser-based task completion | 67.3% | — | 65.4% |
| Online-Mind2Web | Browser tasks (screenshot-only) | 92.8% | — | 70.9%* |
| ARC-AGI-2 (Verified) | Abstract pattern reasoning | 73.3% | — | 52.9% |
| GPQA Diamond | Graduate-level science questions | 92.8% | 92.6% | 92.4% |
*Online-Mind2Web GPT-5.2 comparison is ChatGPT Atlas Agent Mode (70.9%). Source: openai.com/index/introducing-gpt-5-4/
GPT-5.4 Pricing
| Plan | Price | GPT-5.4 Access | Notes |
|---|---|---|---|
| Free | $0/mo | Limited access | Throttled usage |
| Plus | $20/mo | GPT-5.4 Thinking (full) | Best value for individuals |
| Team | $25–$30/user/mo | GPT-5.4 Thinking (full) | Admin controls, no data training |
| Pro | $200/mo | GPT-5.4 Pro (max performance) | Extended compute, highest capability |
| Enterprise | Custom | GPT-5.4 Pro (via admin) | SSO, RBAC, custom retention |
API Pricing (per million tokens)
| Model | Input | Cached Input | Output |
|---|---|---|---|
| gpt-5.2 | $1.75 | $0.175 | $14.00 |
| gpt-5.4 | $2.50 | $0.25 | $15.00 |
| gpt-5.2-pro | $21.00 | — | $168.00 |
| gpt-5.4-pro | $30.00 | — | $180.00 |
Batch and Flex pricing available at 50% of standard rates. Priority processing at 2x. 1M context requests over 272K count at 2x rate. Source: openai.com/api/pricing
Key Features
1. Native Computer Use (First General-Purpose Model)
GPT-5.4 is the first mainline OpenAI model with native computer-use capabilities baked in — not bolted on as a separate API or beta feature. In Codex and the API, it can issue mouse and keyboard commands in response to screenshots, write Playwright scripts to operate browsers and desktop software, and operate across applications in long multi-step workflows. On OSWorld-Verified (desktop use), it hits 75.0% — past human performance at 72.4%. On WebArena-Verified (browser), it reaches 67.3%. On Online-Mind2Web (screenshot-only browser), 92.8%. The limitation: computer use is still early. Custom confirmation policies exist for safety, but you’re trusting an AI to operate real software with real consequences. Review what you’re authorizing.
2. Tool Search — 47% Token Reduction
This is the feature developers will care about most from a cost perspective. Previously, every API request with tools had to include all tool definitions upfront — for large MCP server ecosystems this could mean tens of thousands of wasted tokens per call. GPT-5.4’s tool search gives the model a lightweight tool index; it fetches full definitions only when it needs to use a specific tool. Result: 47% fewer tokens on tool-heavy workflows (measured on MCP Atlas benchmark with all 36 servers enabled). That’s not a minor efficiency gain — for production agent pipelines, it’s a meaningful cost reduction. Limitation: tool search is a new system; routing accuracy depends on tool naming/descriptions in your index.
3. Steerability — Mid-Response Course Correction
GPT-5.4 Thinking in ChatGPT now surfaces an upfront thinking plan before executing. You see the model’s intended approach, can disagree with it, and redirect before it commits to a path. This is architecturally important for agentic workflows where an incorrect plan wastes compute and causes real-world side effects. The model also maintains longer thinking coherence — it can think harder on difficult tasks without losing context of earlier steps. The limitation: this is currently on chatgpt.com and Android; iOS support is coming soon.
4. 1 Million Token Context Window
GPT-5.4 supports up to 1,048,576 tokens of context in Codex and the API. This enables agents to hold entire codebases, lengthy document sets, or extended multi-session history without chunking. In practice, OpenAI’s long-context benchmarks show strong performance: 86.0% on MRCR v2 8-needle at 64K–128K, dropping to 36.6% at 512K–1M. The long tail of the context window degrades — this is expected behavior across all current models. Practical ceiling for reliable performance appears to be in the 128K–256K range. Requests over 272K count at 2x rate. The limitation: truly effective 1M-token use requires prompt engineering discipline to keep signal density high.
5. Reduced Hallucinations (+33% Factual Accuracy)
On de-identified real user prompts where factual errors were flagged, GPT-5.4’s individual claims are 33% less likely to be false, and full responses are 18% less likely to contain any errors vs GPT-5.2. This reflects continued progress at reducing confabulation — not elimination. GPT-5.4 still hallucinates. The 33% improvement is relative, not absolute. For any high-stakes factual output, verification remains non-negotiable.
6. High-Resolution Image Input (Original Detail Level)
GPT-5.4 introduces a new “original” image input detail level that supports up to 10.24M pixels or 6000-pixel maximum dimension — dramatically higher fidelity than the previous “high” level (now capped at 2.56M pixels / 2048px). This matters for precise localization tasks: identifying exact click coordinates in UI screenshots, reading dense data tables, analyzing high-resolution charts. OpenAI observed strong gains in localization, image understanding, and click accuracy in early testing.
Who Is GPT-5.4 For?
Use GPT-5.4 if you:
- Build AI agents that need to operate real software, use large tool ecosystems, or run long multi-step workflows. This is the best agentic foundation model available today.
- Do professional knowledge work in spreadsheets, presentations, or documents and want AI that produces output at or near professional quality (87.3% on IB modeling tasks).
- Run API-heavy pipelines where tool-search’s 47% token reduction makes a real cost difference at scale.
- Need the best general-purpose reasoning model for complex tasks — GPT-5.4 beats Claude Opus 4.5 and Gemini 3.1 Pro on most hard benchmarks as of today.
- Are a ChatGPT Plus subscriber — GPT-5.4 Thinking replaced GPT-5.2 Thinking automatically for $20/mo with no upgrade needed.
Look elsewhere if you:
- Need low-cost, high-volume inference — GPT-5.4’s $15/M output tokens is not the right model for commodity-tier tasks. Use GPT-5 mini ($2/M output) or GPT-4.1-nano for that.
- Run production computer-use at scale without careful human oversight — computer use is early, confirmation policies need configuration, and real-world error recovery is still rough.
- Are on a free plan expecting full capability — free tier access is throttled. The real GPT-5.4 experience requires Plus ($20/mo) at minimum.
- Need truly reliable 1M context performance — long-tail context (512K–1M) degrades meaningfully. If your use case requires reliable recall past 256K, test thoroughly before deploying.
GPT-5.4 vs Competitors
| Feature | GPT-5.4 | Claude Opus 4.5 | Gemini 3.1 Pro | Grok 4.2 |
|---|---|---|---|---|
| Starting Price (API Input/M) | $2.50 | ~$3.00 | ~$1.25 | ~$5.00 |
| Context Window | 1M tokens | 200K tokens | 2M tokens | 128K tokens |
| Native Computer Use | ✅ Native (75% OSWorld) | ✅ Computer Use API | ⚠️ Limited | ❌ |
| Coding (SWE-Bench Pro) | 57.7% | ~55% | ~54% | ~56% |
| Knowledge Work (GDPval) | 83.0% | ~75% | ~74% | ~72% |
| Best For | Agents, knowledge work, coding | Long documents, nuanced writing | Multimodal, long context, cost | Real-time data, X integration |
| ChatGPT Integration | ✅ Native | Claude.ai only | Gemini.google.com only | Grok.x.ai only |
| Consumer Price (Full Access) | $20/mo (Plus) | ~$20/mo | ~$20/mo (Gemini Advanced) | $30/mo (SuperGrok) |
| Max Performance Tier | $200/mo (Pro) | ~$100/mo (Max) | ~$20/mo | $30/mo |
Competitor benchmarks are estimates based on publicly reported data where available. See our best AI chatbots comparison for full analysis, and our MiniMax M2.5 review for a frontier model benchmark deep-dive.
What They Don’t Advertise: The Real Controversies
Computer Use Safety Is Still Unsolved
GPT-5.4 can control your computer. Literally — it can move your mouse, type keystrokes, execute code, and interact with any software on a machine you connect it to. OpenAI ships “custom confirmation policies” for developers to configure what requires human approval. But the real-world attack surface is significant: prompt injection through websites the model browses, misinterpretation of ambiguous instructions in complex workflows, errors in multi-step operations with irreversible consequences. OpenAI has classified GPT-5.4 as “High cyber capability” under their Preparedness Framework. They know this is a different risk profile than a chatbot. Whether developer safeguards are configured correctly in every deployment is a different question.
The $200/Mo Pro Tier Premium
GPT-5.4 Pro costs $200/month — 10x the Plus price. What do you get? The Pro tier provides extended compute for maximum performance. On BrowseComp, GPT-5.4 Pro reaches 89.3% SOTA vs GPT-5.4’s 82.7%. On GDPval, the difference is 83.0% vs 82.0%. On ARC-AGI-2, it’s 83.3% vs 73.3%. That last gap is significant — but for most professional use cases (writing, coding, spreadsheets), the standard GPT-5.4 at Plus pricing closes the gap substantially. The Pro tier is for research-grade workloads, the hardest reasoning problems, and teams where the performance delta justifies 10x the cost. For most people, Plus at $20/mo delivers 95%+ of the capability.
Hallucinations Haven’t Been Solved
GPT-5.4 is 33% less likely to produce false individual claims than GPT-5.2. That’s meaningful progress. It still hallucinates. “33% fewer false claims” means false claims still exist. For medical, legal, financial, or safety-critical output, the verification requirement hasn’t changed. The improvement in GDPval and professional task benchmarks is real — but these are structured tasks with clear success criteria. Open-ended factual generation remains a known weakness of all current language models.
Environmental Cost of Frontier Compute
GPT-5.4 requires more compute than GPT-5.2 per equivalent task (higher per-token pricing reflects this). OpenAI’s token efficiency argument is that fewer tokens are needed per problem — but “most token-efficient reasoning model” still means significant compute at scale. OpenAI doesn’t publish energy consumption data for frontier model training or inference. This is an industry-wide transparency gap that’s worth flagging.
OpenAI Shipping While Anthropic Fights the Pentagon
Today, March 5, 2026, OpenAI is dropping GPT-5.4 while Anthropic is reportedly engaged in regulatory disputes with the Department of Defense over AI deployment restrictions. The strategic contrast is sharp: OpenAI is cozy with government, shipping aggressively, and expanding enterprise contracts. Anthropic is fighting oversight battles. Neither approach is inherently right or wrong — but the competitive dynamic is increasingly about government relationships, not just model benchmarks. OpenAI’s deployment velocity and regulatory alignment give it a structural advantage in government and enterprise that pure capability comparisons don’t capture.
Pros and Cons
Pros
- ✅ First general-purpose model to beat humans at computer use — 75.0% OSWorld vs 72.4% human baseline. Not marketing, actual benchmark data.
- ✅ 83% GDPval — matches professionals across 44 occupations. The most comprehensive measure of real-world AI utility available.
- ✅ 47% token reduction via tool search — meaningful cost savings for agent-heavy API workflows at scale.
- ✅ Computer use is in the mainline model — not a separate API, not a beta. It’s the same model you’re already using.
- ✅ Steerability with upfront thinking plan — actually useful for agentic safety; you can review and redirect before execution commits.
- ✅ 33% fewer false claims — measurable factual accuracy improvement over GPT-5.2 on real user prompts.
- ✅ 1M token context window — enables genuinely new use cases: full codebase analysis, massive document review, long-running agent memory.
- ✅ Fast mode in Codex (1.5x token velocity) — no intelligence tradeoff, just faster throughput. Developer experience improvement.
- ✅ 87.3% on IB analyst spreadsheet tasks — practical business value, not just academic benchmarks.
Cons
- ❌ $200/mo Pro tier — justifiable for power users, a hard sell for everyone else when Plus delivers most of the value at $20/mo.
- ❌ Computer use is still early — real-world error recovery, prompt injection risks, and complex workflow reliability are unsolved problems. Not production-ready for everything.
- ❌ Still hallucinates — 33% fewer false claims is progress, not a fix. Verification still required for high-stakes outputs.
- ❌ 1M context degrades past 256K — long-tail performance drops significantly. True 1M-token reliability is aspirational for most use cases today.
- ❌ $2.50/$15 API pricing vs GPT-5.2’s $1.75/$14 — higher per-token cost, though token efficiency may offset this for agentic workflows.
Getting Started with GPT-5.4
- ChatGPT users (Plus/Team/Pro): GPT-5.4 Thinking is already live — it replaced GPT-5.2 Thinking automatically. Open ChatGPT, verify you’re on GPT-5.4 in the model picker. GPT-5.2 Thinking remains available under Legacy Models until June 5, 2026.
- API developers: Update your model ID to
gpt-5.4. For maximum performance:gpt-5.4-pro. Review OpenAI’s updated documentation for computer use best practices and the neworiginalimage detail parameter. - Enable tool search: For API workflows with large tool ecosystems (especially MCP), implement the tool search capability to capture the 47% token reduction. Follow naming conventions carefully — tool routing depends on clear descriptions.
- Test computer use: In Codex, try the new Playwright (Interactive) skill for visual debugging. Start with low-stakes, reversible tasks to calibrate the model’s behavior before deploying autonomous workflows. Configure custom confirmation policies for anything involving file writes, emails, or external API calls.
- Enterprise users: Enable early access via admin settings. The ChatGPT for Excel add-in launched today — deploy it to IB analysts, finance teams, or anyone doing heavy spreadsheet work. The 87.3% benchmark on IB modeling tasks makes this one of the clearest enterprise ROI cases for AI we’ve seen.
Frequently Asked Questions
What is GPT-5.4?
GPT-5.4 is OpenAI’s current frontier AI model, released March 5, 2026. It’s available in ChatGPT (as GPT-5.4 Thinking for Plus, Team, and Pro subscribers), the OpenAI API (model ID: gpt-5.4), and Codex. It fuses the coding capabilities of GPT-5.3-Codex with GPT-5.2’s general reasoning, and adds native computer-use capabilities, a 1M token context window, and tool search for efficient agentic workflows. A GPT-5.4 Pro variant is available for maximum performance.
When was GPT-5.4 released?
GPT-5.4 was released on March 5, 2026. It began rolling out simultaneously in ChatGPT, the OpenAI API, and Codex on that date. ChatGPT Plus, Team, and Pro users received access immediately; Enterprise and Edu plan users can enable it via admin settings.
How is GPT-5.4 different from GPT-5.3?
GPT-5.3 was released as GPT-5.3-Codex — a specialist coding model. GPT-5.4 incorporates GPT-5.3-Codex’s coding capabilities into a general-purpose model while adding native computer use (75% OSWorld vs GPT-5.3-Codex’s 74%), a 12-point jump in GDPval professional work performance (83% vs 70.9%), tool search for agent efficiency, steerability with upfront thinking plans, and a 1M token context window. GPT-5.4 matches or outperforms GPT-5.3-Codex on coding benchmarks while adding broader capabilities.
What is GPT-5.4 Thinking?
GPT-5.4 Thinking is the name for GPT-5.4 as it appears in ChatGPT. It’s the same underlying model as the API’s gpt-5.4, but with ChatGPT-specific features enabled, including the upfront thinking plan (visible before the model executes), mid-response steering, and enhanced deep web research. GPT-5.4 Thinking replaced GPT-5.2 Thinking for ChatGPT Plus, Team, and Pro subscribers as of March 5, 2026.
Can GPT-5.4 control my computer?
Yes, in Codex and the API. GPT-5.4 is the first general-purpose OpenAI model with native computer-use capabilities. It can issue mouse and keyboard commands, write Playwright automation scripts, operate browsers and desktop software, and complete multi-step workflows across applications. It scored 75.0% on OSWorld-Verified, surpassing the 72.4% human performance baseline. Computer use requires developer configuration of the computer tool in the API and is not available in standard ChatGPT chat mode. OpenAI recommends configuring custom confirmation policies for autonomous workflows.
How much does GPT-5.4 cost?
In ChatGPT: Free plan gets limited access. Plus ($20/mo) and Team ($25–30/user/mo) get full GPT-5.4 Thinking access. Pro ($200/mo) gets GPT-5.4 Pro with maximum performance. In the API: GPT-5.4 costs $2.50 per million input tokens and $15.00 per million output tokens. GPT-5.4 Pro costs $30.00/$180.00 per million tokens. Batch processing is available at 50% of standard rates. Requests over 272K tokens in the 1M context window count at 2x rate.
Is GPT-5.4 better than Claude Opus 4.5?
On most published benchmarks as of March 5, 2026, GPT-5.4 leads. It scores 83.0% on GDPval vs Claude Opus 4.5’s approximately 75%, 75.0% on OSWorld-Verified (desktop computer use) vs Claude’s lower scores, and 57.7% on SWE-Bench Pro. Claude Opus 4.5 retains strengths in nuanced long-form writing, instruction following, and extended document analysis. For agentic workflows, coding tasks, and professional knowledge work by the numbers, GPT-5.4 is currently ahead. For raw writing quality and contextual nuance, Claude remains competitive.
What is GPT-5.4 Pro?
GPT-5.4 Pro is the maximum-performance variant of GPT-5.4. It’s available in ChatGPT Pro ($200/mo) and the API (gpt-5.4-pro at $30/$180 per million tokens). It delivers higher scores on the hardest benchmarks: BrowseComp 89.3% SOTA (vs GPT-5.4’s 82.7%), ARC-AGI-2 83.3% (vs 73.3%), and FrontierMath Tier 4 38.0% (vs 27.1%). For most professional use cases, standard GPT-5.4 at Plus pricing closes the gap to within a few percentage points. GPT-5.4 Pro is for research-grade reasoning, the hardest problem types, and production workloads where maximum performance justifies the cost.
Does GPT-5.4 have a 1 million token context window?
Yes. GPT-5.4 supports up to 1,048,576 tokens (approximately 1 million) in the API and Codex. However, in ChatGPT, context windows for GPT-5.4 Thinking remain the same as GPT-5.2 Thinking — the 1M window is an API/Codex feature. For requests over 272K tokens, usage counts at 2x the standard rate. Long-context performance is strong up to about 256K tokens; reliability degrades at 512K–1M in current benchmarks.
Is GPT-5.4 worth upgrading to?
For ChatGPT Plus subscribers: yes, automatically — GPT-5.4 Thinking replaced GPT-5.2 Thinking with no action required. For API developers: yes, especially if you run agent pipelines where tool search’s 47% token reduction pays back the higher per-token cost. For the Pro upgrade ($200/mo): only if you need maximum performance on the hardest reasoning tasks; standard GPT-5.4 handles 95%+ of professional use cases at Plus pricing. The computer use capabilities alone represent a step-change for automation use cases — if you build AI agents, upgrading to GPT-5.4 in the API is a clear call.
Final Verdict
Rating: 9.4/10
GPT-5.4 is the best general-purpose AI model available today, and it’s not close. The headline benchmark — 75.0% on OSWorld-Verified, clearing the 72.4% human performance baseline — represents the first genuine crossover moment for AI computer operation. But the more commercially significant number is 83.0% GDPval: matching professional human performance across 44 occupations. That’s not a party trick. That’s a model that does actual work at professional quality.
The 0.6 deduction is split between two honest limitations: computer use is still early (real-world error recovery, prompt injection risks, and autonomous workflow safety need more maturity before production deployment at scale), and the $200/mo Pro tier creates a pricing cliff that most users won’t need to cross — but will feel when they consider it.
Buy it now: If you’re an AI agent developer, a knowledge worker doing serious spreadsheet/presentation work, or a ChatGPT Plus subscriber who wants the best — this is your model. The upgrade from GPT-5.2 is material, not marginal.
Wait if: You need low-cost commodity inference (GPT-5 mini or GPT-4.1-nano are the right tools), or you’re considering the Pro tier without specific high-difficulty reasoning use cases — standard GPT-5.4 at Plus pricing is the right call for 90% of users.



