Crawl4AI is an open-source Python framework built for one job: turning websites into clean, structured data that AI models can actually use. It takes raw HTML, strips the noise, and outputs Markdown or JSON that feeds directly into LLM pipelines, RAG systems, and downstream automation without the usual cleanup overhead.
With over 66,000 GitHub stars and an Apache 2.0 license, it’s one of the most widely adopted tools in the AI data collection space. Unlike managed scraping APIs that charge per request, Crawl4AI runs locally on your infrastructure. You control the extraction logic, the browser behavior, and the proxy configuration.
That last point matters more than most guides acknowledge. Crawl4AI works fine against a handful of pages without any proxy. But production-volume crawling hits the same wall every scraper hits: rate limiting, IP bans, and anti-bot systems that flag automated traffic before your first batch finishes. Crawl4AI has built-in proxy support and rotation strategies specifically because the tool doesn’t function at scale without them.
This guide covers installation, extraction strategies, and proxy configuration with working code throughout.
Installation and Setup
Crawl4AI requires Python 3.9 or higher. Install it and run the setup command to pull browser dependencies:
The setup command installs Playwright’s Chromium browser, which Crawl4AI uses under the hood for rendering JavaScript-heavy pages. If you run into browser issues:
Run crawl4ai-doctor to verify your installation. It checks Python version, browser binaries, and core dependencies. All green means you’re ready.
Your First Crawl
The simplest Crawl4AI script:
AsyncWebCrawler launches a headless Chromium instance, loads the page, waits for JavaScript to render, and returns a result object. result.markdown contains the page content converted to clean Markdown with BM25 filtering applied to strip navigation, footers, ads, and other boilerplate that would waste tokens in an LLM context window.
The result object also exposes result.html for raw HTML, result.links for extracted links, result.media for images and video, and result.metadata for page-level information like title and description.
For quick one-off jobs, the CLI works:
For production pipelines, stick with the Python API. That’s where you get extraction strategies, proxy rotation, and session management.
Extraction Strategies
Crawl4AI supports three extraction approaches, each suited to different situations.
CSS and XPath Schema Extraction
When you know the page structure, CSS and XPath selectors are the fastest and most reliable method. Define a schema mapping selectors to field names, and Crawl4AI returns structured JSON:
This works well for e-commerce sites, directories, and anything with a repeating layout. It’s also the cheapest extraction method since it doesn’t require an LLM API call.
LLM-Based Extraction
For pages with inconsistent or complex layouts, Crawl4AI can hand the content to an LLM and extract structured data based on a natural-language prompt. It supports OpenAI, Anthropic, DeepSeek, Groq, and any provider compatible with the LiteLLM interface.
The LLM receives cleaned Markdown rather than raw HTML, which cuts token usage and improves accuracy. More expensive per page than CSS selectors, but it handles messy layouts that would require constant selector patching with the structured approach. If a site rearranges its product cards every other week, LLM extraction keeps working while your CSS schema breaks.
Cosine Similarity and BM25 Filtering
Crawl4AI’s default content cleaning uses BM25 scoring to identify the most relevant blocks on a page. You can also apply cosine similarity filtering to extract only content semantically related to a specific query, which is useful when collecting AI training data where relevance matters more than completeness. Both run locally with no API calls, keeping them viable for high-volume crawls where per-page LLM costs would add up fast.
Configuring Proxies
This is where most Crawl4AI deployments either work or fall apart. A single-page test against a cooperative site needs no proxy. A production crawl hitting hundreds of pages across multiple domains will get blocked without one.
The reasons are the same ones covered in any web scraping proxy guide: sites enforce rate limits per IP, anti-bot systems flag non-human traffic patterns, and your crawler’s single IP becomes a liability at any meaningful volume.
Crawl4AI handles proxy configuration through ProxyConfig in CrawlerRunConfig, giving you per-crawl control over routing.
Basic Setup
The simplest configuration:
You can also pass the proxy as a string or dictionary:
Authenticated Proxies
Most commercial proxy services require credentials:
ProxyConfig.from_string() handles inline credential formats too:
Supported Formats
ProxyConfig.from_string() accepts:
| Format | Example |
|---|---|
| HTTP | http://user:[email protected]:8080 |
| HTTPS | https://proxy.example.com:8080 |
| SOCKS5 | socks5://proxy.example.com:1080 |
| IP:port | 192.168.1.1:8080 |
| IP:port:user:pass | 192.168.1.1:8080:user:pass |
One caveat: Playwright does not support SOCKS5 proxies with authentication. If your provider requires credentials, use HTTP. This is a Playwright limitation, not a Crawl4AI one.
Rotation
A single proxy IP gets flagged eventually, regardless of quality. Rotation distributes requests across multiple addresses so no individual IP draws enough attention to trigger blocks.
Crawl4AI ships with RoundRobinProxyStrategy:
For larger pools, load from environment variables:
Choosing the Right Proxy Type
The type of proxy matters as much as whether you use one.
| Proxy Type | IP Source | Trust Level | Best For |
|---|---|---|---|
| Datacenter | Hosting providers | Low | Low-protection targets, speed-priority jobs |
| Residential | Real ISP connections | Medium-High | General-purpose scraping, most sites |
| Mobile (4G/5G) | Carrier networks via CGNAT | Highest | Sites with aggressive bot protection |
Datacenter proxies are cheap and fast, but their IPs are registered to hosting companies. Any competent anti-bot system (Cloudflare, DataDome, Akamai) flags them on sight.
Residential proxies use IPs assigned by real ISPs to home internet connections, giving them higher trust scores with target sites. They handle most scraping targets without issues and are the solid middle-ground choice.
Mobile proxies are the hardest for anti-bot systems to act against. They use real 4G and 5G carrier IPs shared among thousands of legitimate users through CGNAT (Carrier-Grade NAT). When a site sees traffic from a mobile IP, blocking it risks cutting off every real user sharing that address. That trade-off is what makes them effective against even the most aggressive protection layers, and the best option for Crawl4AI deployments hitting well-defended sites.
Anti-Bot Detection and Proxy Escalation
Crawl4AI v0.8.5 introduced a multi-tier fallback system that automatically escalates when requests get blocked.
After each crawl attempt, Crawl4AI inspects the response for known anti-bot signals: Cloudflare challenge pages, 403/429 status codes, firewall blocks from Imperva, Sucuri, and similar services. If blocking is detected, escalation begins.
The first tier retries through your proxy list in order. The recommendation is to sort proxies cheapest-first: datacenter before residential, residential before mobile. Your most expensive IPs only fire when cheaper ones have already failed.
The second tier repeats the full proxy rotation for additional rounds, controlled by max_retries in CrawlerRunConfig. For a list of three proxies with max_retries=2, that’s nine total attempts before the system gives up or moves on.
The third tier is a fallback function: a custom async function you provide that receives the URL and returns raw HTML as a last resort. You might use it to hit an external scraping API, pull from cache, or try an alternative source entirely.
Combined with Crawl4AI’s other anti-detection features (user-agent randomization, viewport variation, Shadow DOM flattening), the escalation system means fewer dead requests and more complete datasets. If the sites you’re targeting also deploy CAPTCHAs, you’ll need a CAPTCHA solving service on top of this. Proxies handle IP-based blocks. CAPTCHA solvers handle the challenge layer. Different problems, different tools.
Production Configuration
A few settings make a noticeable difference at scale.
Speed and Resource Usage
If you only need text content, skip everything else:
text_mode disables image loading. Stripping ads and CSS cuts additional bandwidth. The per-page savings are small, but they compound fast across thousands of pages.
Handling Dynamic Content
Sites that load content after the initial page render need special handling:
scan_full_page scrolls the entire page to trigger lazy-loaded content. Without it, you’ll miss anything below the fold. If you’ve ever looked at scraped output and wondered why half the data was missing, this was probably the reason.
Session and Identity Management
For sites that require login or maintain state, persistent context keeps cookies and session data between crawl runs:
Pair this with identity settings that match your proxy’s geography:
A US-based mobile proxy reporting ja-JP as the browser locale is the kind of mismatch that anti-bot systems pick up on immediately. Small detail, easy to overlook, but it’s the difference between a crawler that works and one that gets flagged on the second request.
Crawl4AI vs Firecrawl
Both convert web content into LLM-ready formats, but the architecture is fundamentally different.
Crawl4AI is local-first. You run it on your own machine, manage browser instances and proxy configuration yourself, and pay nothing per request. The tradeoff is infrastructure overhead. Firecrawl is API-first, designed around a hosted service where you send URLs and get structured data back. You can self-host it, but the design centers on a centralized API with Docker.If you need granular control over browser behavior, proxy rotation, and extraction logic, Crawl4AI is the better fit. If you want a managed pipeline without the infrastructure burden, Firecrawl is worth evaluating. For a broader look at the space, see the best AI web scrapers roundup.