The Rise of Computer Use and Agentic Coworkers

⬅️ Back to Articles

📝 ARTICLE INFORMATION

🎯 HOOK

Imagine delegating an entire business process; an office lease negotiation, a full marketing campaign, a financial reconciliation; to an AI that works without your involvement or explicit instruction. This vision of autonomous, task-oriented AI agents has long been the field’s north star. Recent advances in computer use are finally bringing it within reach.

💡 ONE-SENTENCE TAKEAWAY

Computer-using agents that can navigate browsers and desktops like humans represent the most significant advancement to date in replicating human labor capabilities, but their success depends on verticalized deployment, contextual tuning, and enterprise-specific integration; not just general AI power.

📖 SUMMARY

This a16z enterprise thesis examines the emergence of computer-using AI agents; systems that can operate within browser and desktop environments the way humans do, clicking through interfaces, logging into systems, and navigating legacy software without programmatic API access.

The authors trace how recent projects (OpenAI’s ChatGPT Agent, Anthropic’s Claude, Google’s Project Mariner, and startups like Manus and Context) have moved beyond the limitations of traditional robotic process automation (RPA). Unlike earlier automation tools that relied on intricate prompt engineering and predefined workflows, these agents can tackle complete, end-to-end digital workflows by accessing software through visual interfaces or Model Context Protocol (MCP) tool integrations.

The article’s central argument is that computer use is the key enabler of true agents, because agent effectiveness depends on two things: the number of tools they can access, and the ability to reason across them. Computer use dramatically expands both; giving agents the breadth to work with any software and the intelligence to chain actions into full workflows.

For startups, the primary opportunity lies in properly verticalizing computer use. Enterprise software environments are highly specialized and unintuitive; different companies use the same software differently, implementing customized views, workflows, and data models. Providing the right context to a model for these environments is complicated; what counts as relevant context, and what’s the best way to deliver it?; and solving this problem will define the first generation of agentic coworkers.

The article concludes that despite current limitations in capability (struggling with complex or unfamiliar interfaces) and efficiency (operating too slowly and expensively to compete with humans), substantial improvements are expected within 6–18 months, paving the way for agents that work across existing software stacks and optimize higher-level strategic objectives.

🔍 INSIGHTS

Core Insights:

  • Beyond RPA: Current agent offerings more closely resemble advanced RPA tools than true autonomous systems. Computer use breaks through this ceiling by giving agents access to any software humans interact with, bypassing the traditional need for APIs or manually programmed tools.

  • The Multiplicative Interplay of Tools and Reasoning: The potential of computer-using AI agents emerges through the combination of tool accessibility and reasoning capability. As agents gain broader toolsets and simultaneously become better at using them, the range and complexity of workflows they can handle grows exponentially. Emergent capabilities may arise when agents solve context retrieval by autonomously exploring, retrieving, and synthesizing context over bespoke sequences of actions.

  • Verticalization is the Moat: It is unlikely that a computer-using agent trained solely on general software will navigate complex enterprise software environments out-of-the-box. Highly focused startups, rather than model providers, will be best positioned to address vertical- and company-specific challenges, including how to provide contextual grounding that goes beyond adding text to a prompt.

  • Two Model Approaches Converge: Pixel-based models (operating on screenshots to generate mouse or keyboard actions) and DOM/code-based models (processing structured HTML and accessibility trees) are both being explored. The market indicates that DOM/code-based approaches alone are often good enough for most tasks, with higher accuracy and much lower latency than pixel-based approaches.

  • Durable Orchestration is Foundational: Long-running multi-step workflows require workflow engines that persist event histories, enforce retries, and resume computation after faults. Solutions like Inngest, Temporal, Azure Durable Functions, and AWS Step Functions provide the fault-tolerance that production-grade agent deployment demands.

Broader Connections:

  • AI as Labor Arbitrage: The primary opportunity around AI has been automating work and capturing labor spend. Computer use represents the most significant advancement to date in replicating human labor capabilities, particularly for the long tail of software tools that lack API access.

  • The Legacy Software Problem: Many enterprises run on Epic, SAP, and Oracle systems. Computer-using agents with reasoning abilities and GUI navigation capabilities fill the gaps that prevented end-to-end automation of work with these systems.

  • Agent Swarms: In the near future, swarms of agents will work together, staying in sync with each other and their human counterparts through existing systems of record and communication channels, combining domain-specific capabilities with horizontal competencies.

🛠️ FRAMEWORKS & MODELS

1. The Computer-Using Agent Stack:

LayerDescriptionExamples
Interaction FrameworksStructured ways for models to perceive and act on interfacesOmniParser, Stagehand, Browser-Use, Cua, Skyvern
ModelsDecision-making core that interprets inputs and emits commandsClaude 4 Sonnet, UI-TARS, Qwen-VL, OpenCUA
Durable OrchestrationWorkflow engines persisting event histories and enforcing retriesInngest, Temporal, Azure Durable Functions, AWS Step Functions
Browser Control LayersAbstractions for issuing commands to browsersCDP (Chrome DevTools Protocol), Playwright, Puppeteer, custom Cua layers
BrowsersExecution substrates rendering interfaces where agents actChromium-based browsers, Lightpanda
Execution EnvironmentsCloud and desktop infra for scaling agent sessionsAnchor Browser, Browserbase, Steel, Hyperbrowser, Scrapybara

2. Agent Capability Roadmap (6–18 months):

DimensionCurrent StateNear-Term Improvement
CapabilityStruggles with complex or unfamiliar interfacesConstrained domain + task-specific context at inference; synthetic interaction traces for training
EfficiencyToo slow and expensive for human-level competitionVision-language model compression/distillation; quantization; caching; rule-based controllers for routine inputs

3. Domain-Specific Agent Profiles:

  • Marketing Agent: Handles audience segmentation, creative ad generation, A/B testing, budget optimization, campaign monitoring, and reporting.
  • Finance Agent: Manages financial reconciliation, fraud detection, budgeting, invoice processing, and regulatory-compliant reporting.
  • Sales Agent: Identifies high-potential prospects, performs personalized outreach, schedules meetings, analyzes call transcripts, and updates CRM data in real-time.

💬 QUOTES

  1. “Imagine being asked to find new office space for your company… Now imagine delegating the entire process to an AI: identifying requirements, researching locations, scheduling tours, negotiating leases, even handling insurance and unexpected issues; all without your involvement or explicit instructions.”

    Context: The authors’ vision of what true AI agents should be able to accomplish. Significance: Frames the standard against which current agent offerings fall short, setting clear aspirational benchmarks.

  2. “What we’ve heard from the market is that this approach alone is good enough for most tasks, in many cases having higher accuracy and much lower latency than the pixel-based approach.”

    Context: On DOM/code-based LLMs versus pixel-based models. Significance: Counterintuitive market feedback that structured approaches often outperform pure vision; important signal for architecture decisions.

  3. “The challenge ahead is not proving whether agents can work, but shaping how they are tuned, contextualized, and deployed within real enterprises.”

    Context: The authors’ conclusion on where startup opportunity lies. Significance: Shifts the competitive question from technology (which is commoditizing) to integration and context; where startups have structural advantages over model providers.

  4. “Startups that master this contextualization will have a distinct advantage in delivering capable and customized agents to enterprises.”

    Context: On verticalization strategy. Significance: Identifies contextualization; understanding enterprise-specific workflows, providing relevant context at inference, and navigating the tension between preserving old processes and reinventing them; as the core defensible moat.

  5. “The goal was to keep Coke within arm’s reach of desire; to make sure it was always available, always present, always tempting.”

    Context: A Coca-Cola executive describing distribution strategy. Significance: This quote from the companion article on tobacco-style UPF engineering illustrates the parallel principle in enterprise software: ubiquity and frictionless access drive adoption, whether for consumer products or digital labor tools.

⚡ APPLICATIONS

Practical Guidance:

  • Evaluate Agent Readiness by Workflow Tail: Identify processes in your organization that have been resistant to automation due to legacy software gaps. These are prime targets for computer-using agents; start with high-volume, rules-defined workflows that require GUI navigation.

  • Assess Context Requirements Honestly: Before deploying a general agent, catalog what contextual knowledge it would need: onboarding documentation, browser action recordings, SOPs, and institutional knowledge that humans acquire through training. Factor this context-provisioning work into deployment timelines.

  • Build Agent Training Infrastructure: Invest in capturing interaction traces from your best performers performing key workflows. These traces; stored in sandboxed environments; are the highest-quality training data for fine-tuning agents on your specific systems.

  • Separate Routine from Complex Tasks: Route straightforward inputs (keystrokes, simple clicks, standard form fills) to rule-based controllers. Reserve LLM inference for decisions requiring reasoning. This hybrid approach reduces cost and latency dramatically.

  • Plan for Durable Orchestration: Any workflow that spans more than a few seconds needs durable execution infrastructure. Don’t build agents on brittle synchronous calls; design for fault tolerance, retries, and state persistence from the start.

For Founders and Builders:

  • Win by being the context layer: Model capabilities will commoditize. The defensible position is understanding an industry, its workflows, and its data well enough to provide the right context to any underlying model.

  • Start with a wedge, expand horizontally: Begin by dominating one function (e.g., finance reconciliation or sales outreach) within one industry before expanding. The contextual learning compounds.

  • Invest in sandbox training environments: Simulating enterprise software environments at scale for training is infrastructure that creates durable advantage. Companies that can generate millions of synthetic interaction traces in safe replicas will train better agents.

  • Build evaluation harnesses: OSWorld and similar benchmarks are starting points, but your real benchmark is how well agents perform on your customers’ actual workflows. Invest in proprietary evaluation suites.

Pitfalls to Avoid:

  • Assuming general agents suffice: A model that can use computers generally is not the same as one that can use your specific enterprise software. Context is not optional; it is the product.

  • Over-indexing on benchmark performance: Top leaderboard scores don’t guarantee real-world effectiveness. Enterprise software environments diverge significantly from benchmarks in layout complexity and workflow specificity.

  • Neglecting latency and cost: Agent-per-action inference costs add up in long workflows. Design for efficiency from day one, not as an optimization after deployment.

  • Ignoring legacy systems as features, not bugs: Companies use SAP and Epic not because they want to, but because they have to. Agents that respect and navigate these systems rather than trying to replace them will win enterprise adoption.

📚 REFERENCES

Key Sources:

  • a16z. (2025). “The Rise of Computer Use and Agentic Coworkers.” Andreessen Horowitz Enterprise.
  • a16z. (2025). “RIP to RPA: The Rise of Intelligent Automation.”
  • a16z. (2025). “A Deep Dive into MCP and the Future of AI Tooling.”
  • a16z. (2025). “AI Turns Capital to Labor.”
  • a16z. (2024). “What Is an AI Agent?” a16z Podcast.
  • a16z. (2025). “Why the World Still Runs on SAP.”
  • a16z. (2025). “Your Data Agents Need Context.”

Technical Benchmarks and Research:

  • OSWorld leaderboard (computer-using agent evaluation)
  • OmniParser (pixel-to-element-graph conversion)
  • Stagehand (act() and extract() APIs over DOM-filtered accessibility)
  • Browser-Use, Cua, Skyvern (visual grounding with structured control)

Companies and Products Mentioned:

  • Anthropic Claude, OpenAI ChatGPT Agent, Google Project Mariner
  • Manus, Context, Deeptune
  • Inngest, Temporal, Azure Durable Functions, AWS Step Functions
  • Anchor Browser, Browserbase, Steel, Hyperbrowser, Kernel, Scrapybara
  • Lightpanda, Simular S2

Frameworks and Protocols:

  • MCP (Model Context Protocol)
  • CDP (Chrome DevTools Protocol)
  • DOM-based and accessibility-tree approaches

Traction Indicators:

  • Claude for Chrome extension embedding agent control directly in the browser
  • Enterprise adoption of computer-using agents for legacy software (SAP, Epic, Oracle)
  • Growing ecosystem of infrastructure startups around browser automation and durable orchestration

Crepi il lupo! 🐺