Steven Gonsalvez

Software Engineer

← Back to Blog

Browser Tools for AI Agents Part 2: The Framework Wars (browser-use, Stagehand, Skyvern)

Part of the Browser Tools for AI Agents series

browser-use aistagehand browserskyvern automationai browser framework comparisondom vs vision aiai agent browser toolsexpect testingllm browser automation 2026

In Part 1 we covered the browser infrastructure layer. The plumbing. Remote browsers, CDPs, the headless Chromium sprawl. If you missed it, go read that first because this one builds directly on top of it.

Now we're going up a level. The frameworks. The SDKs. The bits that actually let your agent do things in a browser instead of just staring at one. Same lens as Part 1: this is about giving your coding agents the tools for a closed loop of research, implementation, and validation. Not consumer agentic browsers.

And here's where it gets properly interesting, because there's a civil war happening in this space and most people haven't noticed yet. On one side: DOM-first. On the other: vision-first. And in the middle, a messy hybrid zone where the most pragmatic engineering is happening.

The Architecture Split That Defines Everything

Before we get into individual tools, you need to understand the fundamental schism. When an AI agent needs to interact with a web page, it has to see the page somehow. There are exactly three ways to do this.

DOM-first means you parse the HTML, extract the accessibility tree, convert it to text, and feed that text to your language model. The model reasons over structured data. It knows there's a button with aria-label "Submit" at coordinates roughly here. This is fast, token-efficient, and works brilliantly on well-structured modern web apps.

Vision-first means you take a screenshot and feed the raw image to a vision-capable model. The model looks at what a human would see. No DOM parsing. No accessibility trees. Just pixels. This is slower, more expensive per step, and burns through your token budget like a weekend in Shoreditch burns through your wallet. But it works on anything. Canvas apps, PDFs rendered in-browser, legacy portals with obfuscated markup, pages where the DOM is a lying mess of nested iframes and shadow roots.

Hybrid means you do both and let the model pick. Or you use the DOM for structured elements and fall back to vision when the DOM goes sideways.

Every framework in this post has picked a side in this war. That choice cascades into everything: speed, cost, accuracy, which websites it can handle, and what breaks when a site redesigns.

Right. Let's meet the contenders.

browser-use: The Python Gorilla

85,000 GitHub stars. 9,000 forks. MIT licensed. If you've Googled "AI browser automation" at any point in the last year, you've found browser-use. It's the 800-pound gorilla of this space.

The architecture is honest and simple. Playwright underneath (so Chromium, Firefox, or WebKit). An agent loop on top that takes your natural language instruction, observes the browser state, picks an action, executes it, observes again. Rinse, repeat. It supports basically every LLM you'd want: Claude, GPT-4o, Gemini, their own ChatBrowserUse model, even local stuff via Ollama if you're feeling brave.

The 89.1% WebVoyager score gets thrown around a lot in their marketing. And look, that's legitimately strong. On the benchmark. In production, your mileage varies depending on site complexity, auth flows, CAPTCHAs, and how much the target site hates bots.

Here's the thing that nags at me though. Every single step in the agent loop requires an LLM call. Every observation, every action decision, every verification. You're paying per step. For a ten-step workflow on GPT-4o, that's maybe $0.15-0.30. Run that a thousand times a day for a client and your CFO starts asking questions. Run it ten thousand times and your CFO starts updating their CV.

📚 Geek Corner
browser-use's core loop is a classic ReAct (Reason-Act) pattern. The agent receives the current page state as text (DOM snapshot plus visible elements), the model generates a thought and action, the action executes, and the new state feeds back in. The @tools.action() decorator lets you extend this with custom actions, which is where it gets genuinely flexible. But the fundamental constraint remains: every cycle is a round-trip to your LLM provider. There's no caching, no replay, no "I've done this before so let me skip the reasoning." Every run pays full price. Their ChatBrowserUse model at $0.20 per million input tokens tries to address this, but you're still paying per step.

Feels like: a really smart intern who needs to phone their mentor before every decision. Reliable? Yes. Expensive at scale? Also yes.

Stagehand: The TypeScript Caching Play

21,800 stars. Browserbase's baby. TypeScript-native. And here's where the economics get interesting, because the Stagehand team figured out something that the browser-use crowd seems to be ignoring.

Caching.

Stagehand v3 is a proper architectural rethink. They ripped out Playwright entirely and went straight to CDP (Chrome DevTools Protocol). The result: 44% faster on shadow DOM and iframe interactions. But speed isn't the headline. The headline is what they do with repeated actions.

When Stagehand encounters a page for the first time, it does the same thing everyone else does: calls the LLM, figures out what to click, where to type, what to extract. But then it caches that mapping. Next time it hits the same page (or a sufficiently similar one), it replays the cached actions without calling the LLM at all. Zero inference cost. Zero latency.

Think about what this means for production workloads. Your agent fills out the same insurance form fifty times a day? The first run costs money. Runs two through fifty are basically free. The amortised cost per task drops off a cliff.

The API surface is dead simple: act() for single actions, extract() for structured data extraction (with Zod schema validation, because TypeScript), and agent() for multi-step flows. The self-healing bit is clever too. If the DOM shifts and a cached action fails, Stagehand automatically re-engages the LLM to figure out the new mapping, caches that, and carries on.

The catch? Stagehand is TypeScript. If your agent stack is Python (and statistically, it probably is), you're either wrapping it in a subprocess, running a sidecar service, or rewriting chunks of your orchestration. Not a deal-breaker, but it's friction.

Also: Stagehand is built by Browserbase. The open-source SDK works locally, but the intended deployment path is their cloud browser infrastructure. The SDK is the on-ramp. The cloud is the toll road. That's fine and transparent and honestly good engineering, but know what you're signing up for.

📚 Geek Corner
The v3 CDP engine is the real technical win here. Playwright was designed as a testing framework that happens to automate browsers. CDP is the raw protocol. By going native, Stagehand eliminated Playwright's abstraction layer and its associated round-trips. Each WebSocket message goes straight to the browser. For iframes and shadow roots (which require frame-scoped routing), this cuts latency nearly in half. The caching layer sits on top: it hashes page structure and action descriptions, stores successful DOM-to-action mappings, and replays them deterministically. When a replay fails (because the site changed), it falls back to LLM inference and updates the cache. It's the same pattern as a JIT compiler: interpret first, compile the hot paths.

Feels like: the developer who writes a script for anything they do more than twice. Lazy in the best possible way.

Skyvern: The Vision Bet

21,000 stars. The one that looks at screenshots instead of parsing HTML.

Skyvern's thesis is that the DOM is a lie. And honestly? They're not entirely wrong. Modern web apps are a jungle of React virtual DOMs, Web Components, shadow roots, iframes-within-iframes, canvas elements, and SVGs doing things SVGs were never meant to do. Parsing all of that reliably is a mug's game.

So Skyvern takes a screenshot. Feeds it to a vision-capable model. The model sees what a human would see: buttons, forms, navigation, text. No DOM parsing needed. No accessibility tree. Just pixels.

The Skyvern 2.0 architecture is a three-phase agent loop: Planner decomposes your objective into sub-goals, Actor executes individual actions on websites, Validator confirms success and triggers replanning if something went wrong. This Planner-Actor-Validator cycle pushed their WebVoyager score from about 45% (v1) to 85.85% (v2). That's a properly impressive jump, and they publish full evaluation traces at eval.skyvern.com, which is more transparency than most competitors offer.

Here's where I'd use Skyvern over the DOM-first tools: legacy enterprise portals (those GWT monstrosities from 2008 that HR still makes you use), canvas-heavy applications, sites with aggressively obfuscated markup, anything where the visual layout is the only reliable source of truth. For a standard modern React app with clean semantic HTML? DOM-first tools will be faster and cheaper.

The cost issue is real. Vision model calls are expensive. A screenshot through GPT-4o costs more than a text-based DOM snapshot, and Skyvern needs screenshots at every step. Their cloud product bundles anti-bot detection, proxy rotation, CAPTCHA solving, and parallel execution, which is where the value proposition actually lives. The self-hosted path (pip install or Docker) is available but you're on your own for the infrastructure bits.

📚 Geek Corner
Skyvern's "swarm of agents" architecture is worth understanding. Rather than a single monolithic agent, each phase (Plan, Act, Validate) can run different model configurations. The Planner might use GPT-4o for complex reasoning while the Actor uses GPT-4o-mini for cheaper per-step execution. The Validator sits independently and confirms outcomes, triggering replanning when the Actor's actions didn't actually work (a surprisingly common failure mode in browser automation where "click succeeded" doesn't mean "the thing you wanted to happen actually happened"). The 85.85% WebVoyager score was achieved on cloud browsers with production-representative conditions, not cosy local setups with safe IP addresses.

Feels like: hiring a human contractor to manually test your site. More expensive per task, but they can handle anything you throw at them.

Notte: The Full-Stack Edge Play

1,900 stars. YC S25. SSPL licensed. And here's where you need to pay attention to the fine print.

Notte's pitch is that it's the full-stack version of what everyone else gives you in pieces. Browser infrastructure, agent framework, and deployment runtime, all in one package. The compute runs next to the browser. Zero-latency automation because there's no network hop between your agent logic and the browser instance.

The hybrid approach is smart: use Playwright-compatible scripting for deterministic steps, engage the LLM only when you need reasoning or adaptability. They claim this cuts costs by 50%+ versus pure-agent approaches, and the maths checks out. If half your workflow is "click this specific button, fill this specific field" and only the other half requires actual reasoning, why pay for LLM inference on the predictable bits?

Their benchmark numbers (79% LLM evaluation accuracy, 47 seconds per task, 96.6% reliability) look solid for an early-stage tool. The Patchright browser backend (a Playwright fork) means you get stealth capabilities out of the box: CAPTCHA solving, anti-detection, proxy rotation.

But. That SSPL license. Server Side Public License. This is the MongoDB license. It means if you offer Notte's functionality as a service to third parties, you must open-source your entire stack. For internal tools, fine. For a SaaS product that wraps Notte? You need to talk to their commercial licensing team. This isn't a small detail. It changes the entire build-vs-buy calculation for startups.

Bottom line: Notte is doing interesting engineering and the full-stack approach solves real deployment pain. But read the license before you build your business on it.

expect: The Categorical Misfit (and Something I Use Daily)

3,000 stars. FSL-1.1-MIT license (functional source license). And this one is fundamentally different from everything else on this page.

Quick note before I get into it: expect probably belongs in Part 1 more than here. It's closer to a low-level validation tool than a framework. But it also spans both spaces because it orchestrates an agent that drives a browser, which puts it in framework territory. I've stuck it here because the comparison with browser-use and Stagehand is the thing people keep getting wrong, and this felt like the right place to set that straight.

expect is not a browser automation framework. It doesn't help your agent browse the web. It doesn't fill forms or scrape data or navigate portals.

expect is a testing agent that uses a browser. And it's become part of my daily workflow.

The workflow goes like this: you're building a web app. You make changes. You run expect. It reads your git diff, generates a test plan based on what you changed, shows you the plan in an interactive TUI, and then executes those tests in a real browser using Playwright underneath. If a test fails, your coding agent gets the failure report and can attempt to fix the issue. It records videos of every test run for debugging.

The key distinction is the direction of agency. With browser-use or Stagehand, your agent uses a browser to accomplish a task. With expect, a testing agent uses a browser to validate your code. The browser is a verification environment, not an action environment.

It integrates with Claude Code, Copilot, Codex, Gemini CLI, Cursor. The cookie management is clever: it extracts your real browser sessions so tests run with your actual auth state, no mock login flows needed.

I bring this up because people keep comparing expect to browser-use in the same breath, and they're solving completely different problems. It's like comparing a race car to a crash test rig. Both involve cars going fast. The purpose is entirely different. I use expect as the validation step in my agent's closed loop: the agent writes code, expect checks whether the UI actually works, and if it doesn't, the agent gets the failure report and fixes it. That's the "validate" in the research-plan-implement-validate cycle this whole series is about.

Feels like: a QA engineer who reads your pull request and immediately goes to test the bits you changed. You didn't ask. They just did it. Brilliant.

The Comparison Table Nobody Asked For (But Everyone Needs)

browser-use Stagehand Skyvern Notte expect
Approach DOM-first DOM + cache Vision-first Hybrid Testing agent
Language Python TypeScript Python Python TypeScript
Stars 85k 21.8k 21k 1.9k 3k
License MIT MIT Apache 2.0 SSPL FSL-1.1-MIT
Accuracy 89.1% (WebVoyager) No public benchmark 85.85% (WebVoyager) 79% (LLM eval) N/A
Speed ~3s/step 1-3s (cached: <100ms) Slow (image encoding) <50ms latency Depends on agent
Token efficiency Heavy (LLM every step) Good (cached = zero LLM) Worst (vision tokens) Good (LLM only when needed) Moderate (plan gen)
Tokens per 10 steps ~7k-15k ~2k-5k (first), ~0 (cached) ~30k-50k (images) ~3k-8k ~5k (plan phase)
Cost per task $0.02-0.30 $0.01-0.10 (drops on repeat) $0.10-0.50 $0.02-0.15 $0.01-0.05
LLM per step Always First run only Always (vision) Only for reasoning Per test gen
Self-hosted Yes Yes Yes (Docker) Yes Yes
Cloud option browser-use Cloud Browserbase Skyvern Cloud Notte Console Coming soon
Anti-bot Cloud only Via Browserbase Built-in Built-in N/A
Caching No Yes (auto-replay) Planned Partial (scripting) N/A
Best for General agent tasks Repeated workflows Legacy/visual sites Full-stack deploy Code verification

Note: Vercel agent-browser is covered in Part 1 alongside the other low-level tools. It's a CLI wrapper around Playwright, not a framework.

The Decision Tree

Start here. What are you actually building?

If your agent needs to test code changes in a browser, stop. That's expect. None of the others do this.

If your agent needs to browse the web and your primary constraint is token budget inside a tool-calling agent, check out Vercel agent-browser in Part 1. It's a low-level CLI tool, not a framework.

If you're building repeated workflows (same sites, same forms, many times a day), Stagehand's caching means your costs approach zero after the first run. If you can stomach TypeScript.

If you need to handle visual-heavy or legacy sites where the DOM is unreliable, Skyvern's vision approach is the pragmatic choice. Pay more per step, get reliability on weird sites.

If you want the largest community and ecosystem with the most examples, browser-use is the safe bet. It's Python, it's everywhere, it works. You'll pay for every step, but at least you know it works.

If you want everything in one box and the SSPL license doesn't scare you, Notte's full-stack approach eliminates a lot of integration pain.

And if the answer is "I need all of these capabilities in different parts of my system," then congratulations. You've arrived at the same conclusion most production teams reach: you compose multiple tools. Stagehand for repeated workflows, Skyvern for the weird legacy bits, expect for CI validation, and something like agent-browser or dev-browser from Part 1 for the lightweight observation layer. The frameworks are building blocks, not religions.

What's Coming in Part 3

All of these frameworks need to run somewhere. In production, that means managed browser infrastructure: session pools, proxy rotation, anti-detection, CAPTCHA solving, geographic distribution. Part 3 covers the managed infra layer: Browserbase, Bright Data's Agent Browser, Steel, Hyperbrowser, and the rest of the "browsers-as-a-service" market that's appeared seemingly overnight.

That's where the money is. And where the vendor lock-in lives.

See you there.

Share𝕏in

Comments & Reactions