Decisional | Spreadsheet-Native AI Agents for Workflow Automation

Computer-use demos are impressive, but agentic automation with tools outperforms by 10x. METR's research reveals AI agent workflows struggle with long action sequences—making agentic process automation with tool-first design the path to reliable agents.

The Problem with Computer Use

Computer-use is when AI uses an interface the way a human would, clicking around buttons. It's impressive for a demo, but it has somehow felt wrong. I've heard this from people: "It's like consuming 10 gallons of gas to go a mile."

Claude Code and Claude Coworker feel so much more useful than Claude Computer Use or GPT Agent. Manus is a world-class general Agent not because of computer use but rather its ability to browse the web through headless browsing and using commands on a virtual machine.

After digging into METR's evaluation of AI agents completing real-world tasks, a clear pattern emerges: agents struggle with stringing together long sequences of actions far more than they struggle with knowledge or skills.

The Experiment That Reveals Everything

Imagine a simple task: categorize 50 invoices into folders by vendor name. Watch the difference:

Context Window85K tokens

Noise

Useful: 7K

Noise: 78K

8% signal

Task: Categorize 50 invoices by vendor

Take screenshot of folder

Identify first file icon

Double-click to open

Take screenshot of PDF

Visually scan for vendor

Close the file

Screenshot folder again

Right-click → New folder

Type vendor name

Drag file to folder

Screenshot to verify

Repeat 49 more times...

Steps shown: 1 / 12For all 50 files: ~600 steps

Same outcome. Radically different reliability.

Why Tools Win: The Math of Error Compounding

Agent error rate compounding has been a long-quoted problem by folks bearish on AI. Try adjusting the success rate below to see how dramatically it affects outcomes:

Error Compounding

Per-step success rate

90%

Typical accuracy

Tool Use

~5 steps

59.0%

overall success

Computer Use

~600 steps

0.000000%

overall success

At 90% per-step, computer use has 0.0000...% chance of completing.

This isn't theoretical. In the METR paper, current frontier models have nearly 100% success rate on tasks taking humans under 4 minutes but it drops below 10% on tasks taking over 4 hours. This is a problem in chain length, not capability.

The Signal-to-Noise Problem

Error compounding is only half the story. The other half is information quality. When an agent calls ls -la, it receives structured, complete data. When it takes a screenshot, it must interpret millions of pixels:

Information Quality

Tool Output: ls -la

-rw-r--r-- 1 user staff 4096 Jan 13 invoice_001.pdf -rw-r--r-- 1 user staff 3842 Jan 13 invoice_002.pdf -rw-r--r-- 1 user staff 5120 Jan 12 invoice_003.pdf -rw-r--r-- 1 user staff 4096 Jan 12 invoice_004.pdf -rw-r--r-- 1 user staff 3584 Jan 11 invoice_005.pdf

100% actionable information

Screenshot Output

invoice_001.pdf

invoice_002.pdf

invoice_q3_final_v2_re...

invoice_004.pdf

...180 more below

~10% useful

~10% useful, rest is window chrome, truncated names

The tool gives structured, complete data. The screenshot gives a partial, noisy representation that requires visual reasoning to decode—and the agent can only see what's visible, not the 180 files below the fold.

Decision Framework

So the question we should all be asking is:

When to use what?

Can I resolve my action space to a set of tools?

Yes → Use Tools

Specific, limited to available tools
Robust — atomic operations
Efficient — structured I/O
First choice

No → Use Computer Use

General — can do anything a human can
Fragile — errors compound
Expensive — tokens on visual processing
Last resort / escape hatch

What the Research Shows

This isn't just intuition—research consistently validates the tool-first approach:

Key Research Findings

METR Study: AI Tools Slowed Developers by 19%

In a study with 16 experienced open-source developers completing 246 issues, AI tools actually slowed them down—despite developers believing they were faster. Tasks involving long action chains showed the steepest degradation.

→ METR Research Blog

OSWorld: Humans 72% vs Best AI 12%

On the OSWorld benchmark of 369 real computer tasks, humans accomplish 72.36% while the best AI models initially achieved only 12.24%. Recent advances with reasoning models have pushed this to ~60%, but the gap remains significant.

→ OSWorld Benchmark

Error Compounding: 95% Per-Step → 36% Over 20 Steps

Research shows that when agents operate with 95% reliability per step, success drops to just 36% over 20-step workflows. For 600-step computer use sequences, the math becomes catastrophic.

→ Superface: The AI Agent Reality Gap

Tool Invocation Improves Accuracy 2-3x

The OSWorld-MCP benchmark shows MCP tools dramatically improve task success: OpenAI o3 jumped from 8.3% to 20.4%, and Claude improved from 40.1% to 43.3%. Yet even the best performers only invoke tools 36.3% of the time when available.

→ OSWorld-MCP Paper (arXiv)

Even Anthropic Acknowledges the Gap

When launching computer use, Anthropic explicitly noted it remains "experimental—at times cumbersome and error-prone" and recommended developers "begin exploration with low-risk tasks."

→ Anthropic: Introducing Computer Use

The pattern is clear across all these studies: agents that minimize action chains and maximize structured tool interactions consistently outperform those that rely on computer-use style interactions.

Conclusion

The agent hype cycle has fixated on the most general capability (computer use) while undervaluing the most reliable one (tools). METR's research makes the tradeoff clear: generality comes at the cost of compounding errors and noisy information.

For agents that need to reliably complete real work, tools aren't a limitation.

Agent Design = Tool Design that resolves your Action Space into native tools

Use computer use as the escape hatch for everything else. Your success rates will thank you.

Computer Use vs Tool Use: Why Tools Should Be First-Class Citizens in Agentic AI