Crawl4AI: Fast LLM-Ready Web Scraping Without the Bloat

6月29日 Published inNetwork Tools

Crawl4AI strips web pages down to the essentials. It is engineered for LLMs, AI agents, and data pipelines that require clean input at high speed. No fluff, no mandatory API keys—just the content you need.

The tool handles geolocation-aware crawling, extracts tables directly into DataFrames, manages a browser pool with warmed-up tabs, and captures network traffic. It also includes full Model Context Protocol (MCP) integration for AI tools.

The output is high-quality Markdown, optimized specifically for RAG (Retrieval-Augmented Generation) and fine-tuning. Performance tests show results are roughly six times faster than traditional scraping methods. By using intelligent filtering algorithms, it ensures you don’t waste money on expensive model calls for irrelevant data. With a highly active GitHub repository, it has quickly become a top-tier choice for developers.

Installation

# Install the package
pip install -U crawl4ai

# Or install the latest pre-release
pip install crawl4ai --pre

# Run the post-install setup
crawl4ai-setup

# Verify the installation
crawl4ai-doctor

If you encounter issues with browser dependencies, install Chromium manually:

python -m playwright install --with-deps chromium

Quick Python Test

import asyncio
from crawl4ai import *

async def main():
    async with AsyncWebCrawler() as crawler:
        result = await crawler.arun(
            url="https://www.nbcnews.com/business",
        )
        print(result.markdown)

if __name__ == "__main__":
    asyncio.run(main())

New CLI Option

# Basic crawl to Markdown
crwl https://www.nbcnews.com/business -o markdown

# Deep crawl using BFS (Breadth-First Search), max 10 pages
crwl https://docs.crawl4ai.com --deep-crawl bfs --max-pages 10

# LLM-powered extraction for a specific query
crwl https://www.example.com/products -q "Extract all product prices"

Core Features

Markdown Generation

  • Generates clean, structured Markdown with precise formatting.
  • Fit Markdown: Uses heuristics to strip noise, leaving only AI-ready text.
  • Automatically converts page links into numbered citations.
  • Employs BM25 filtering to isolate the core content signal.
  • Allows you to write and implement your own generation strategies.

Structured Data Extraction

  • Compatible with any LLM, whether open-source or proprietary.
  • Flexible chunking strategies: topic-based, regex, or sentence-level.
  • Uses cosine similarity to find content that matches your specific query.
  • Supports CSS selectors and XPath for high-speed, pattern-based extraction.
  • Define custom JSON schemas to extract repeating elements accurately.

Browser Control

  • Use custom browser profiles to bypass bot detection.
  • Connect via Chrome DevTools Protocol (CDP) for remote extraction.
  • Support for persistent profiles with saved logins and cookies.
  • Advanced session management for multi-step crawling workflows.
  • Proxy support with authentication.
  • Fully adjustable headers, user agents, and cookies.
  • Supports Chromium, Firefox, and WebKit.
  • Viewport auto-adjustment ensures all page elements are captured.

Crawling & Scraping

  • Extracts images, audio, video, and responsive srcset attributes.
  • Executes JavaScript and waits for asynchronous content to load.
  • Captures screenshots during the crawl process.
  • Handles raw HTML and local files.
  • Identifies and extracts internal, external, and iframe links.
  • Offers custom hooks at every stage of the crawl.
  • Built-in caching to prevent redundant fetches.
  • Comprehensive metadata extraction.
  • Automated lazy-load handling and infinite scroll support.

Deployment

  • Dedicated Docker image with a FastAPI server included.
  • JWT authentication for secured API access.
  • One-click deployment with a token-based API gateway.
  • Architected for horizontal scaling.
  • Production-ready cloud configurations.

Additional Capabilities

  • Stealth mode to mimic human browsing behavior.
  • Tag-based targeting for specific content areas.
  • Detailed link analysis.
  • Robust error handling and logging.
  • CORS support and static file serving.
  • Clear, developer-focused documentation.

Installation Options

Standard Installation

pip install crawl4ai
crawl4ai-setup

The setup script automates Playwright configuration. If it fails, run the following command:

playwright install
# or
python -m playwright install chromium

Sync Version (Legacy/Deprecated)

pip install crawl4ai[sync]

Development Install

git clone https://github.com/unclecode/crawl4ai.git
cd crawl4ai
pip install -e .                    # Basic editable installation

Optional feature sets:

pip install -e ".[torch]"
pip install -e ".[transformer]"
pip install -e ".[cosine]"
pip install -e ".[all]"

Docker Deployment

The updated Docker configuration is streamlined, featuring a browser pool, an interactive playground, MCP integration, and multi-architecture support.

docker pull unclecode/crawl4ai:0.6.0-rN
docker run -d -p 11235:11235 --name crawl4ai --shm-size=1g unclecode/crawl4ai:0.6.0-rN
# Access the playground at http://localhost:11235/playground

Testing the Docker API:

import requests

response = requests.post(
    "http://localhost:11235/crawl",
    json={"urls": "https://example.com", "priority": 10}
)
task_id = response.json()["task_id"]

result = requests.get(f"http://localhost:11235/task/{task_id}")

Usage Snippets

Heuristic Markdown with Pruning

import asyncio
from crawl4ai import AsyncWebCrawler, BrowserConfig, CrawlerRunConfig, CacheMode
from crawl4ai.content_filter_strategy import PruningContentFilter
from crawl4ai.markdown_generation_strategy import DefaultMarkdownGenerator

async def main():
    browser_config = BrowserConfig(headless=True, verbose=True)
    run_config = CrawlerRunConfig(
        cache_mode=CacheMode.ENABLED,
        markdown_generator=DefaultMarkdownGenerator(
            content_filter=PruningContentFilter(threshold=0.48, threshold_type="fixed", min_word_threshold=0)
        ),
    )

    async with AsyncWebCrawler(config=browser_config) as crawler:
        result = await crawler.arun(
            url="https://docs.micronaut.io/4.7.6/guide/",
            config=run_config
        )
        print(f"Raw Markdown length: {len(result.markdown.raw_markdown)}")
        print(f"Fit Markdown length: {len(result.markdown.fit_markdown)}")

if __name__ == "__main__":
    asyncio.run(main())

JavaScript Execution & CSS Extraction (No LLM Required)

import asyncio
import json
from crawl4ai import AsyncWebCrawler, BrowserConfig, CrawlerRunConfig, CacheMode
from crawl4ai.extraction_strategy import JsonCssExtractionStrategy

async def main():
    schema = {
        "name": "KidoCode Courses",
        "baseSelector": "section.charge-methodology .w-tab-content > div",
        "fields": [
            {"name": "section_title", "selector": "h3.heading-50", "type": "text"},
            {"name": "section_description", "selector": ".charge-content", "type": "text"},
            {"name": "course_name", "selector": ".text-block-93", "type": "text"},
            {"name": "course_description", "selector": ".course-content-text", "type": "text"},
            {"name": "course_icon", "selector": ".image-92", "type": "attribute", "attribute": "src"}
        ]
    }

    extraction_strategy = JsonCssExtractionStrategy(schema, verbose=True)

    browser_config = BrowserConfig(headless=False, verbose=True)
    run_config = CrawlerRunConfig(
        extraction_strategy=extraction_strategy,
        js_code=["""(async () => {const tabs = document.querySelectorAll("section.charge-methodology .tabs-menu-3 > div");for(let tab of tabs) {tab.scrollIntoView();tab.click();await new Promise(r => setTimeout(r, 500));}})();"""],
        cache_mode=CacheMode.BYPASS
    )

    async with AsyncWebCrawler(config=browser_config) as crawler:
        result = await crawler.arun(
            url="https://www.kidocode.com/degrees/technology",
            config=run_config
        )
        items = json.loads(result.extracted_content)
        print(f"Successfully extracted {len(items)} items")
        print(json.dumps(items[0], indent=2))

if __name__ == "__main__":
    asyncio.run(main())

LLM-Powered Structured Extraction

import os
import asyncio
from crawl4ai import AsyncWebCrawler, BrowserConfig, CrawlerRunConfig, CacheMode, LLMConfig
from crawl4ai.extraction_strategy import LLMExtractionStrategy
from pydantic import BaseModel, Field

class OpenAIModelFee(BaseModel):
    model_name: str = Field(..., description="Model name")
    input_fee: str = Field(..., description="Input token fee")
    output_fee: str = Field(..., description="Output token fee")

async def main():
    browser_config = BrowserConfig(verbose=True)
    run_config = CrawlerRunConfig(
        word_count_threshold=1,
        extraction_strategy=LLMExtractionStrategy(
            llm_config=LLMConfig(provider="openai/gpt-4o", api_token=os.getenv('OPENAI_API_KEY')),
            schema=OpenAIModelFee.schema(),
            extraction_type="schema",
            instruction="""From the crawled content, extract all model names and their input/output token fees. Provide one JSON object per model."""
        ),
        cache_mode=CacheMode.BYPASS,
    )

    async with AsyncWebCrawler(config=browser_config) as crawler:
        result = await crawler.arun(
            url='https://openai.com/api/pricing/',
            config=run_config
        )
        print(result.extracted_content)

if __name__ == "__main__":
    asyncio.run(main())