Browser Tools for AI Agents Part 4: Skip the Browser, Save 80% on Tokens

April 2, 2026|14 min read

llm token optimization markdown.new cloudflare jina reader ai trafilatura extraction html to markdown ai content extraction reduce token costs web scraping for llm 2026

Right. Last orders, everyone. Final part. And I'm going to open with something that might sting a bit if you've been feeding raw HTML into your language models like I was six months ago.

You are spending roughly 80% of your token budget on HTML tags that your model cannot use. Not "doesn't use well." Cannot use. The <div class="sidebar-widget-trending-topics-container-v3"> that wraps your actual content? That's tokens. The seventeen layers of nested <span> from whatever React component tree generated the page? Tokens. The inline SVG for the site's logo that gets base64-encoded into the DOM? You guessed it. Tokens. And you're paying for every single one.

This is the part of the series where we put the browser down. Quick reminder on what we're solving for: giving coding agents the right tools for a closed loop of research, build, and validate. The "research" part is where content extraction shines. If your agent just needs to read a page, not click buttons on it, you don't need a browser at all. (We'll cover native app validation for React Native, Capacitor, and Swift in a separate series.)

So the question nobody seems to ask early enough in their agent architecture: do I actually need a browser for this?

The Maths That'll Make You Wince

I ran Cloudflare's own documentation page through two pipelines. Raw HTML: 9,541 tokens. Cleaned markdown: 1,678 tokens. That's an 82% reduction. On a single page. Their blog post about the feature itself? 16,180 tokens as HTML, 3,150 as markdown. Same story. Eighty percent gone.

Now scale it.

An agent fetching 50 pages a day (not unusual for a research agent, a RAG pipeline, a news summariser) is looking at roughly 35 million tokens a month if you're passing raw HTML. At GPT-4o pricing that's real money. The markdown equivalent of those same 50 pages? A fraction. We're talking about saving $60-70 a month on a single agent's browsing habits. Scale to a fleet of agents across a team and you're into four figures annually, burned on <div> tags that encode zero semantic information.

📚 Geek Corner
Why is the reduction so consistent at ~80%? Markdown's syntax is minimal by design. A heading is `#`. A table cell boundary is `

This isn't a micro-optimisation. This is the difference between an agent architecture that scales and one that bankrupts your API budget before you've even got to production.

The Hosted Lot: Let Someone Else Do the Parsing

Two services have emerged as the obvious choices when you want clean markdown without running your own extraction pipeline.

markdown.new is Cloudflare's entry. It converts any URL to markdown using their "Markdown for Agents" infrastructure, which sits at the CDN layer. When a Cloudflare-proxied site receives a request with Accept: text/markdown, the conversion happens at the edge before the response even leaves their network. No browser. No rendering. Just content negotiation at the HTTP level. For sites that aren't on Cloudflare (or haven't enabled the feature), they fall back to their Browser Rendering API, which spins up a headless browser on their infrastructure, not yours.

The free version lets you crawl up to 500 pages from a single domain with configurable depth up to 10 levels. The API is straightforward. Stick the URL in, get markdown out. The response even includes an x-markdown-tokens header telling you exactly how many tokens you're about to feed your model, which is a thoughtful touch for context window planning.

Currently available on Pro, Business, and Enterprise Cloudflare plans at no extra cost. Rate limits aren't explicitly documented, which either means they're generous or they haven't hit a scale problem yet. My money's on the former given it's Cloudflare.

Jina Reader takes a different approach. Prefix any URL with https://r.jina.ai/ and you get back LLM-friendly markdown. That's it. That's the API. The simplicity is almost offensive. Behind the scenes it's doing proper browser rendering (they built a system handling 10 million requests and 100 billion tokens daily at peak), with JavaScript execution, image captioning via vision models, and PDF extraction thrown in.

Free tier gives you 20 RPM with no API key, or 500 RPM with a free key that comes loaded with 10 million tokens. Paid tiers scale to 5,000 RPM with 500 concurrent requests. Average latency sits around 7.9 seconds for a standard page, 2.5 for search queries through their s.jina.ai endpoint.

Quick note on Jina's corporate history since I've seen confusion about this: Jina AI was an independent company founded by Han Xiao, Nan Wang, and Bing He. They raised $39M over two rounds. Elastic completed their acquisition in October 2025, but the Reader product predates that by over a year. The product isn't "from Elastic" any more than Instagram is "from Meta" in terms of its DNA.

Feels like: markdown.new is the fast food drive-through. Your order's ready before you've finished asking. Jina Reader is the proper sit-down restaurant. Takes a bit longer, but they'll handle the weird dietary requirements (JavaScript-heavy pages, PDFs, structured data extraction) that the drive-through can't.

Bottom line: If your target sites are on Cloudflare and have enabled the feature, markdown.new is nearly instant and free. For everything else, Jina Reader is the default choice. I reach for Jina first because the URL-prefix API means I can test it in my browser's address bar, which satisfies the lazy-developer part of my brain that doesn't want to set up curl commands.

Self-Hosted: When You Want to Own the Pipeline

Sometimes you can't (or shouldn't) send your URLs to a third-party service. Compliance. Air-gapped environments. Cost at scale. Whatever the reason, there are solid self-hosted options, and the benchmarks actually tell us which one's best.

Trafilatura is the quiet champion. It's a Python library and CLI tool that consistently tops the extraction benchmarks, and the numbers aren't even close in terms of balance.

Tool	F-Score	Precision	Recall
Trafilatura (standard)	0.909	0.914	0.904
ReadabiliPy	0.874	0.877	0.870
News-Please	0.808	0.898	0.734
Readability-lxml	0.801	0.891	0.729
Goose3	0.793	0.934	0.690
Newspaper3k	0.713	0.895	0.593

Look at that recall column. Goose3 has marginally better precision but misses 31% of the actual content. Newspaper3k misses over 40%. Trafilatura finds 90% of the content while keeping precision above 91%. The ScrapingHub benchmark on 640,000 pages confirmed it. The Bevendorff et al. 2023 evaluation confirmed it again. This thing just works.

It outputs to TXT, markdown, CSV, JSON, HTML, XML, even TEI if you're into that. It handles sitemaps, RSS feeds, parallel processing, and metadata extraction (author, dates, categories). The documentation calls it "robust and reasonably fast," which is the kind of British understatement I respect. No JavaScript rendering though. If the content isn't in the initial HTML response, Trafilatura won't find it.

pip install trafilatura and you're away. Five lines of Python to extract any article on the web. It's the kind of tool that makes you wonder why you ever bothered with anything more complex.

Readability is the engine behind Firefox's Reader View. You've used it without knowing. Click that little document icon in Firefox's address bar that strips a page down to just the article? That's Readability. Mozilla open-sourced the JavaScript library, and there's a Python port called readability-lxml.

It's heuristic-based, using hand-crafted rules to identify main content. The approach is well-tested (hundreds of millions of Firefox users are its QA team, effectively) but it can be conservative. It would rather miss a paragraph than include a sidebar, which means it sometimes clips content you actually wanted. F-score of 0.801 in the benchmarks, which is decent but noticeably behind Trafilatura.

The JavaScript version is the canonical one. The Python port is maintained but not always in lockstep. If you're in a Node.js environment, it's the natural choice. If you're in Python, Trafilatura is the better pick unless you specifically need Readability's conservative behaviour.

Defuddle is the new kid, and it's interesting. Built by the team behind Obsidian's Web Clipper (so these folks have thought deeply about content extraction), it positions itself as a more forgiving alternative to Readability. Where Readability's conservatism sometimes strips useful content, Defuddle uses a multi-pass detection system that recovers when initial passes return nothing.

Clever trick: it analyses a page's mobile CSS to identify elements that can be safely removed. Standardises footnotes, code blocks, and maths equations into consistent HTML structures before conversion. MIT-licensed, works in browsers, Node.js, and CLI. The project carries a "work in progress" warning, which I appreciate for its honesty. Not yet benchmarked in the same rigorous studies as Trafilatura and Readability, but the Hacker News thread was overwhelmingly positive, and the Obsidian pedigree means it's been battle-tested on the kind of weird, cluttered pages that trip up other extractors.

The Utilities Drawer

Two more tools that solve adjacent problems and are worth knowing about.

html2text is the purist's choice. It converts HTML to markdown. That's it. No content extraction. No "find the article" heuristics. No boilerplate removal. You give it HTML, it gives you the markdown equivalent of that entire HTML, ads and navigation and cookie banners and all.

This is useful when you've already done the extraction step (maybe with Readability or Defuddle) and just need the format conversion. Or when you're working with HTML you generated yourself and know is clean. Treating it as an extractor will give you exactly the markdown representation of every piece of junk on the page, which rather defeats the purpose.

newspaper4k is the successor to newspaper3k (which hadn't been updated since September 2020, which in Python library years is roughly the Cretaceous period). It's article-focused: downloads pages, extracts the main article content, and layers on NLP for summaries and keyword extraction. F-score of 0.949 with precision at 0.964, which looks great until you notice the recall is 0.934, meaning it's precise about what it grabs but misses a fair chunk. Good for news articles specifically. Less good for documentation pages, forums, or anything that doesn't look like a news article.

The Decision Tree

Here's the bit I wish someone had given me before I spent three weeks running Playwright for pages that didn't need it.

Your agent needs web content. First question: does it need to interact with the page? Click buttons, fill forms, scroll infinite feeds, navigate SPAs? If yes, you need a browser. Full stop. Go back to Parts 1-3 of this series.

If no (and this is more often than you think), next question: is the target site on Cloudflare with Markdown for Agents enabled? If yes, just add Accept: text/markdown to your request headers. Done. Fastest possible path. No extraction library, no third-party service, no browser.

If the site isn't on Cloudflare, or you don't know: is it a one-off or are you building a pipeline? For quick, one-off grabs, prefix with r.jina.ai/ and move on with your life. For a proper pipeline where you want control, self-host Trafilatura. It handles the extraction and the format conversion in one step, and the benchmarks prove it's the best at both.

If you're in a Node.js environment and prefer JavaScript, use Defuddle for extraction, then pipe through html2text (or Defuddle's own markdown output) for conversion.

If you specifically need NLP features (article summaries, keyword extraction, author/date metadata) and your content is news articles, newspaper4k handles the whole chain.

And if you're at scale (thousands of pages per hour), consider Cloudflare's Browser Rendering API as your extraction backend. You get their infrastructure handling the headless browsers, and you just receive markdown. The crawl endpoint handles up to 500 pages per domain with async job tracking.

📚 Geek Corner

📚 Geek Corner
There's a subtlety in the extraction-vs-browser decision that catches people. Some sites look static but actually load content via JavaScript after the initial page load. A `<div id="content"></div>` that gets populated by a framework after DOMContentLoaded. Your extractor fetches the HTML, sees the empty div, returns nothing useful. This is where tools like Jina Reader have an edge over pure extractors like Trafilatura: Jina renders the page in a real browser before extracting. If you're seeing empty or truncated results from a self-hosted extractor, the page is probably JS-rendered, and you need either a hosted service with browser rendering or your own headless browser feeding HTML to your extractor.

There's a subtlety in the extraction-vs-browser decision that catches people. Some sites look static but actually load content via JavaScript after the initial page load. A <div id="content"></div> that gets populated by a framework after DOMContentLoaded. Your extractor fetches the HTML, sees the empty div, returns nothing useful. This is where tools like Jina Reader have an edge over pure extractors like Trafilatura: Jina renders the page in a real browser before extracting. If you're seeing empty or truncated results from a self-hosted extractor, the page is probably JS-rendered, and you need either a hosted service with browser rendering or your own headless browser feeding HTML to your extractor.

When Extraction Falls on Its Face

Content extraction is brilliant for the 80% of web content that's articles, documentation, blog posts, and static pages. But it has hard limits.

Single-page applications where the entire page is a JavaScript bundle that renders client-side. Your extractor will get a nearly empty HTML shell. Sites behind authentication or paywalls where you need to log in, handle cookies, maybe solve a CAPTCHA. That's browser territory. Anything requiring interaction: filling search forms, clicking "load more," navigating pagination that's handled by JavaScript rather than URL parameters.

Infinite scroll pages. Dynamically loaded content triggered by viewport intersection observers. WebSocket-driven real-time content. Canvas-rendered data visualisations.

For all of these, you need Parts 1-3 of this series. Playwright for the straightforward stuff. Patchright or Scrapling when stealth matters. The extraction tools in this post are for when the content already exists in the HTML and you just need to get it out clean and cheap.

Wrapping the Whole Series

Four parts. Hundreds of tools. Let me boil it down.

Part 1 was Playwright and Puppeteer. The workhorses. If you need a browser and don't have specific stealth requirements, Playwright is the answer. Cross-browser, auto-waiting, proper debugging tools, Microsoft backing. Puppeteer if you're Chrome-only and prefer the Google ecosystem.

Part 2 was the stealth layer. Patchright (Playwright with anti-detection patches) and Scrapling (Python-native with adaptive fingerprinting). For when sites actively try to detect and block automation. Cloudflare Turnstile, Akamai, PerimeterX, the whole bot-detection industry.

Part 3 covered the emerging alternatives. Stagehand for natural-language browser control ("click the login button" instead of page.click('#btn-login-v3-container > div:nth-child(2) > button')). Browser Use for visual-AI-driven navigation. The new breed that treats browser automation as a language problem rather than a DOM problem.

Part 4, this one, is about not using a browser at all. markdown.new, Jina Reader, Trafilatura, Defuddle. The realisation that for most content retrieval tasks, a browser is overkill and you're paying a literal tax on your tokens for the privilege.

The meta-decision tree across all four parts goes like this. You need web content. Can you get it without a browser? (Check Part 4.) Probably yes for static content. If you do need a browser, is the site trying to block you? No: use Playwright (Part 1). Yes: use Patchright or Scrapling (Part 2). Do you want to drive the browser with natural language instead of selectors? Use Stagehand or Browser Use (Part 3).

If I'm being honest about the distribution: maybe 60-70% of the web scraping tasks I see in agent architectures could use extraction instead of a browser. People reach for Playwright by default because it's what they know, and because "open a browser and get the page" is conceptually simple even when it's computationally expensive. The extraction tools in this post are less intuitive but wildly more efficient for the use cases they cover.

The whole series exists because I got fed up watching agents burn through token budgets and compute resources on problems that had simpler solutions. The browser is a Swiss Army knife. Sometimes you need the whole knife. But if all you need is the blade, stop paying for the corkscrew and the tiny scissors.

Now close the browser. Or rather, don't open one in the first place. That's the whole point.

Share𝕏 in

Steven Gonsalvez

Browser Tools for AI Agents Part 4: Skip the Browser, Save 80% on Tokens

Comments & Reactions