GPT-5.4 Sets New Records for Computer Use — AI That Operates Your PC
Released March 5, 2026, GPT-5.4 achieved record scores on every major computer-use benchmark. It can browse the web, fill forms, and operate software autonomously. Here's what that actually means in practice.
◎ What happened?
On March 5, 2026, OpenAI shipped GPT-5.4 — a focused update to GPT-5 that pushed computer-use capabilities far beyond any previous model.
Key benchmark results at launch:
- OSWorld-Verified: Record score — surpassing all prior models including Claude Sonnet 4.5
- WebArena Verified: Record score on automated web navigation tasks
- GDPval: 83% on OpenAI's internal test for knowledge work tasks — the first time any model crossed 80%
- Context window: Expanded to 272,000 tokens from GPT-5's original 128K
GPT-5.4 can operate a computer the way a human does — clicking buttons, navigating browsers, reading screens, filling out forms, and executing multi-step workflows across applications. It doesn't just generate code; it can run the code, see the result on screen, and fix errors in a loop.
OpenAI also shipped GPT-5.5 Instant alongside it — a 400K token context window model with reasoning capabilities priced for high-volume API workloads.
◈ Why does it matter?
Computer use is the threshold where AI transitions from answering questions to getting things done on your behalf.
The implications of an 83% GDPval score are significant. GDPval tests whether a model can complete real knowledge work tasks: researching topics, synthesizing documents, drafting reports, filling spreadsheets, sending structured outputs to systems. At 83%, GPT-5.4 can reliably complete most office tasks without human intervention.
This is the beginning of a genuine agentic economy:
- Customer support workflows that don't require a human in the loop
- Data entry and form processing that runs overnight autonomously
- Software QA agents that can actually use the UI, not just call APIs
- Personal productivity agents that manage calendars, email, and documents
The model has limitations — it still makes errors on complex visual tasks and struggles with CAPTCHAs by design — but the trajectory is clear.
◇ Should you switch?
For agentic and automation tasks: yes, GPT-5.4 is now the benchmark.
Claude Sonnet 4.5 held the computer-use crown through most of 2025 (61.4% on OSWorld). GPT-5.4 has surpassed it on that specific benchmark. For teams building automation pipelines that interact with UIs, this matters.
For pure coding and long-context reasoning: Claude Fable 5 / Sonnet 4.6 still lead. This isn't a "GPT wins everything" moment — the models trade leadership on different dimensions.
Practical advice:
- Building a web scraping / form automation agent → evaluate GPT-5.4 first
- Building a code review or refactoring agent → evaluate Claude Fable 5 first
- High-volume API apps → GPT-5.5 Instant's pricing is designed for this
✦ Who should care?
Related Stories
Anthropic Launches Claude 5 — Then Immediately Blocks Its Most Powerful Version
Claude Fable 5 went generally available on June 9, 2026. Three days later, Anthropic was forced to disable Claude Mythos 5 — its most capable model ever — following a directive from the U.S. Department of Commerce.
MCP Hits 97 Million Installs — Agentic AI Has a Universal Language
Anthropic's Model Context Protocol crossed 97 million installs in March 2026. What started as a Claude-specific tool integration standard is now the backbone of the entire agentic AI ecosystem.