Crawl4AI strips web pages down to the essentials. It is engineered for LLMs, AI agents, and data pipelines that require clean input at high speed. No fluff, no mandatory API keys—just the content you need.
The tool handles geolocation-aware crawling, extracts tables directly into DataFrames, manages a browser pool with warmed-up tabs, and captures network traffic. It also includes full Model Context Protocol (MCP) integration for AI tools.
The output is high-quality Markdown, optimized specifically for RAG (Retrieval-Augmented Generation) and fine-tuning. Performance tests show results are roughly six times faster than traditional scraping methods. By using intelligent filtering algorithms, it ensures you don’t waste money on expensive model calls for irrelevant data. With a highly active GitHub repository, it has quickly become a top-tier choice for developers.
# Install the package
pip install -U crawl4ai
# Or install the latest pre-release
pip install crawl4ai --pre
# Run the post-install setup
crawl4ai-setup
# Verify the installation
crawl4ai-doctor
If you encounter issues with browser dependencies, install Chromium manually:
python -m playwright install --with-deps chromium
Quick Python Test
import asyncio
from crawl4ai import *
async def main():
async with AsyncWebCrawler() as crawler:
result = await crawler.arun(
url="https://www.nbcnews.com/business",
)
print(result.markdown)
if __name__ == "__main__":
asyncio.run(main())
New CLI Option
# Basic crawl to Markdown
crwl https://www.nbcnews.com/business -o markdown
# Deep crawl using BFS (Breadth-First Search), max 10 pages
crwl https://docs.crawl4ai.com --deep-crawl bfs --max-pages 10
# LLM-powered extraction for a specific query
crwl https://www.example.com/products -q "Extract all product prices"
Markdown Generation
Structured Data Extraction
Browser Control
Crawling & Scraping
srcset attributes.Deployment
Additional Capabilities
Standard Installation
pip install crawl4ai
crawl4ai-setup
The setup script automates Playwright configuration. If it fails, run the following command:
playwright install
# or
python -m playwright install chromium
Sync Version (Legacy/Deprecated)
pip install crawl4ai[sync]
Development Install
git clone https://github.com/unclecode/crawl4ai.git
cd crawl4ai
pip install -e . # Basic editable installation
Optional feature sets:
pip install -e ".[torch]"
pip install -e ".[transformer]"
pip install -e ".[cosine]"
pip install -e ".[all]"
Docker Deployment
The updated Docker configuration is streamlined, featuring a browser pool, an interactive playground, MCP integration, and multi-architecture support.
docker pull unclecode/crawl4ai:0.6.0-rN
docker run -d -p 11235:11235 --name crawl4ai --shm-size=1g unclecode/crawl4ai:0.6.0-rN
# Access the playground at http://localhost:11235/playground
Testing the Docker API:
import requests
response = requests.post(
"http://localhost:11235/crawl",
json={"urls": "https://example.com", "priority": 10}
)
task_id = response.json()["task_id"]
result = requests.get(f"http://localhost:11235/task/{task_id}")
Heuristic Markdown with Pruning
import asyncio
from crawl4ai import AsyncWebCrawler, BrowserConfig, CrawlerRunConfig, CacheMode
from crawl4ai.content_filter_strategy import PruningContentFilter
from crawl4ai.markdown_generation_strategy import DefaultMarkdownGenerator
async def main():
browser_config = BrowserConfig(headless=True, verbose=True)
run_config = CrawlerRunConfig(
cache_mode=CacheMode.ENABLED,
markdown_generator=DefaultMarkdownGenerator(
content_filter=PruningContentFilter(threshold=0.48, threshold_type="fixed", min_word_threshold=0)
),
)
async with AsyncWebCrawler(config=browser_config) as crawler:
result = await crawler.arun(
url="https://docs.micronaut.io/4.7.6/guide/",
config=run_config
)
print(f"Raw Markdown length: {len(result.markdown.raw_markdown)}")
print(f"Fit Markdown length: {len(result.markdown.fit_markdown)}")
if __name__ == "__main__":
asyncio.run(main())
JavaScript Execution & CSS Extraction (No LLM Required)
import asyncio
import json
from crawl4ai import AsyncWebCrawler, BrowserConfig, CrawlerRunConfig, CacheMode
from crawl4ai.extraction_strategy import JsonCssExtractionStrategy
async def main():
schema = {
"name": "KidoCode Courses",
"baseSelector": "section.charge-methodology .w-tab-content > div",
"fields": [
{"name": "section_title", "selector": "h3.heading-50", "type": "text"},
{"name": "section_description", "selector": ".charge-content", "type": "text"},
{"name": "course_name", "selector": ".text-block-93", "type": "text"},
{"name": "course_description", "selector": ".course-content-text", "type": "text"},
{"name": "course_icon", "selector": ".image-92", "type": "attribute", "attribute": "src"}
]
}
extraction_strategy = JsonCssExtractionStrategy(schema, verbose=True)
browser_config = BrowserConfig(headless=False, verbose=True)
run_config = CrawlerRunConfig(
extraction_strategy=extraction_strategy,
js_code=["""(async () => {const tabs = document.querySelectorAll("section.charge-methodology .tabs-menu-3 > div");for(let tab of tabs) {tab.scrollIntoView();tab.click();await new Promise(r => setTimeout(r, 500));}})();"""],
cache_mode=CacheMode.BYPASS
)
async with AsyncWebCrawler(config=browser_config) as crawler:
result = await crawler.arun(
url="https://www.kidocode.com/degrees/technology",
config=run_config
)
items = json.loads(result.extracted_content)
print(f"Successfully extracted {len(items)} items")
print(json.dumps(items[0], indent=2))
if __name__ == "__main__":
asyncio.run(main())
LLM-Powered Structured Extraction
import os
import asyncio
from crawl4ai import AsyncWebCrawler, BrowserConfig, CrawlerRunConfig, CacheMode, LLMConfig
from crawl4ai.extraction_strategy import LLMExtractionStrategy
from pydantic import BaseModel, Field
class OpenAIModelFee(BaseModel):
model_name: str = Field(..., description="Model name")
input_fee: str = Field(..., description="Input token fee")
output_fee: str = Field(..., description="Output token fee")
async def main():
browser_config = BrowserConfig(verbose=True)
run_config = CrawlerRunConfig(
word_count_threshold=1,
extraction_strategy=LLMExtractionStrategy(
llm_config=LLMConfig(provider="openai/gpt-4o", api_token=os.getenv('OPENAI_API_KEY')),
schema=OpenAIModelFee.schema(),
extraction_type="schema",
instruction="""From the crawled content, extract all model names and their input/output token fees. Provide one JSON object per model."""
),
cache_mode=CacheMode.BYPASS,
)
async with AsyncWebCrawler(config=browser_config) as crawler:
result = await crawler.arun(
url='https://openai.com/api/pricing/',
config=run_config
)
print(result.extracted_content)
if __name__ == "__main__":
asyncio.run(main())
Open English Dictionary: 25,000+ LLM-Refined Word Entries for Deeper Chinese Understanding
Akaunting Review: Free Open-Source Accounting Software for Small Business
Clueless: A Native AI Meeting Assistant for Mac with Live Transcription
Turn Google Gemini CLI Into a Standard API Proxy for Any OpenAI Client
Machine Learning for Beginners: A Free 26-Lesson Curriculum
Ccundo: Smart Undo and Redo Tool for Claude Code Sessions
mRemoteNG Setup: Manage RDP, SSH, and VNC in One Tabbed Console
Easy-AI-CodeReview: LLM-Powered Automated Code Review for GitLab
Slidev: Markdown-Based Presentations for Developers
ACI.dev: 600+ Tools for AI Agents with Built-In Auth and MCP Support
Motionity: Free Online Animation Editor with Keyframes and Masks
Add Area Fill to Line Charts in Excel: Step-by-Step