Vision-Language-Action Models

LLM Training

Video Foundation Models

Image Tools

Dictionaries & Lexicons

Cryptocurrency Tools

Watermark Removal Tools

OCR Tools

Voice Interaction Models

AI Service Tools

ToolBoost >> Web Scraping Tools >> Firecrawl API: Converting Any Website Into Clean Markdown for LLMs

Firecrawl API: Converting Any Website Into Clean Markdown for LLMs

7月4日 Published inWeb Scraping Tools

Fireplexity is a high-speed AI search engine built on the Firecrawl web scraping API. It provides precise answers backed by live citations and current data. By leveraging Firecrawl to pull real-time web results and GPT-4o-mini to stream responses, the engine ensures every claim is sourced. It further enhances the user experience by integrating TradingView stock charts and suggesting relevant follow-up questions.

Firecrawl itself is a specialized API service designed for data extraction. You provide a URL, and it crawls the page to return clean, structured markdown. It is designed to scrape every reachable subpage automatically without requiring a sitemap, making the resulting output perfectly formatted for LLM consumption.

The service is available as a managed cloud platform as well as an open-source core for self-hosting.

API Key To get started, sign up at Firecrawl to obtain your API key.

Core Capabilities

Scrape: Retrieve content from a single URL in various formats, including markdown, structured data (via LLM), screenshots, or raw HTML.
Crawl: Begin with a root URL and systematically extract content from all linked internal pages in LLM-friendly formats.
Map: Rapidly generate a comprehensive list of every URL associated with a specific website.
Search: Query the broader web and extract the full content of the top results.
Extract: Utilize AI to identify and pull structured data from a single page, a group of pages, or an entire domain.
LLM-Ready Formats: Native support for Markdown, structured JSON, screenshots, HTML, links, and metadata.
Complex Task Management: Firecrawl handles proxies, anti-bot protections, JavaScript rendering, parsing, and multi-page orchestration.
High Configurability: Users can exclude specific tags, crawl authenticated areas using custom headers, and define maximum crawl depths.
Media Support: Capability to process PDFs, DOCX files, and images.
Reliability: Built to successfully fetch data even from highly restricted or technically stubborn websites.
Browser Actions: Support for pre-extraction automation, such as clicking, scrolling, typing, and waiting.

Installing Firecrawl

Python

pip install firecrawl-py

Scraping a URL

Use the scrape_url method to fetch data. Provide the target URL and specify your desired output formats.

Python Example

from firecrawl import FirecrawlApp

app = FirecrawlApp(api_key="fc-YOUR_API_KEY")

scrape_result = app.scrape_url('firecrawl.dev', formats=['markdown', 'html'])
print(scrape_result)

Response Format

The SDK returns the data object directly. A standard API call via cURL produces a payload structured as follows:

{
  "success": true,
  "data": {
    "markdown": "Launch Week I is here! [See our Day 2 Release 🚀](https://www.firecrawl.dev/blog/launch-week-i-day-2-doubled-rate-limits)[💥 Get 2 months free...",
    "html": "<!DOCTYPE html><html lang=\"en\" class=\"light\" style=\"color-scheme: light;\"><body class=\"__variable_36bd41 __variable_d7dc5d font-inter ...",
    "metadata": {
      "title": "Home - Firecrawl",
      "description": "Firecrawl crawls and converts any website into clean markdown.",
      "language": "en",
      "keywords": "Firecrawl,Markdown,Data,Mendable,Langchain",
      "robots": "follow, index",
      "ogTitle": "Firecrawl",
      "ogDescription": "Turn any website into LLM-ready data.",
      "ogUrl": "https://www.firecrawl.dev/",
      "ogImage": "https://www.firecrawl.dev/og.png?123",
      "ogLocaleAlternate": [],
      "ogSiteName": "Firecrawl",
      "sourceURL": "https://firecrawl.dev",
      "statusCode": 200
    }
  }
}

Crawling a Website

Crawling initiates a process that captures a root URL and all accessible subpages. When you submit a crawl job, the API returns a unique ID to monitor the status of the task.

Python Example

from firecrawl import FirecrawlApp, ScrapeOptions

app = FirecrawlApp(api_key="fc-YOUR_API_KEY")

crawl_result = app.crawl_url(
    'https://firecrawl.dev',
    limit=10,
    scrape_options=ScrapeOptions(formats=['markdown', 'html']),
)
print(crawl_result)

When using cURL or the SDK’s asynchronous crawl function, the initial response includes a job ID:

{
  "success": true,
  "id": "123-456-789",
  "url": "https://api.firecrawl.dev/v1/crawl/123-456-789"
}

Checking Crawl Status

crawl_status = app.check_crawl_status("<crawl_id>")
print(crawl_status)

Response Processing

The response data evolves as the crawl progresses. For large datasets exceeding 10MB, the API includes a next URL parameter. Requesting that URL allows you to fetch subsequent 10MB chunks. If the next parameter is absent, the crawl is complete.

In-Progress Response

{
  "status": "scraping",
  "total": 36,
  "completed": 10,
  "creditsUsed": 10,
  "expiresAt": "2024-00-00T00:00:00.000Z",
  "next": "https://api.firecrawl.dev/v1/crawl/123-456-789?skip=10",
  "data": [
    {
      "markdown": "[Firecrawl Docs home page![light logo](https://mintlify.s3-us-west-1.amazonaws.com/firecrawl/logo/light.svg)!...",
      "html": "<!DOCTYPE html><html lang=\"en\" class=\"js-focus-visible lg:[--scroll-mt:9.5rem]\" data-js-focus-visible=\"\">...",
      "metadata": {
        "title": "Build a 'Chat with website' using Groq Llama 3 | Firecrawl",
        "language": "en",
        "sourceURL": "https://docs.firecrawl.dev/learn/rag-llama3",
        "description": "Learn how to use Firecrawl, Groq Llama 3, and Langchain to build a 'Chat with your website' bot.",
        "ogLocaleAlternate": [],
        "statusCode": 200
      }
    }
  ]
}

Extracting Structured Data

LLM-driven extraction allows you to pull structured data from any URL. You can define a specific Pydantic schema for the output or provide a natural language prompt for the AI to interpret.

Python Example with Schema

from firecrawl import JsonConfig, FirecrawlApp
from pydantic import BaseModel

app = FirecrawlApp(api_key="<YOUR_API_KEY>")

class ExtractSchema(BaseModel):
    company_mission: str
    supports_sso: bool
    is_open_source: bool
    is_in_yc: bool

json_config = JsonConfig(schema=ExtractSchema)

llm_extraction_result = app.scrape_url(
    'https://firecrawl.dev',
    formats=["json"],
    json_options=json_config,
    only_main_content=False,
    timeout=120000
)

print(llm_extraction_result.json)

Extraction Output

{
  "success": true,
  "data": {
    "json": {
      "company_mission": "AI-powered web scraping and data extraction",
      "supports_sso": true,
      "is_open_source": true,
      "is_in_yc": true
    },
    "metadata": {
      "title": "Firecrawl",
      "description": "AI-powered web scraping and data extraction",
      "robots": "follow, index",
      "ogTitle": "Firecrawl",
      "ogDescription": "AI-powered web scraping and data extraction",
      "ogUrl": "https://firecrawl.dev/",
      "ogImage": "https://firecrawl.dev/og.png",
      "ogLocaleAlternate": [],
      "ogSiteName": "Firecrawl",
      "sourceURL": "https://firecrawl.dev/"
    }
  }
}

Schema-Free Extraction

Alternatively, you can skip the formal schema and provide a prompt. The LLM will then determine the most logical structure for the output.

curl -X POST https://api.firecrawl.dev/v1/scrape \
    -H 'Content-Type: application/json' \
    -H 'Authorization: Bearer YOUR_API_KEY' \
    -d '{
      "url": "https://docs.firecrawl.dev/",
      "formats": ["json"],
      "jsonOptions": {
        "prompt": "Extract the company mission from the page."
      }
    }'

Schema-Free Output

{
  "success": true,
  "data": {
    "json": {
      "company_mission": "AI-powered web scraping and data extraction"
    },
    "metadata": {
      "title": "Firecrawl",
      "description": "AI-powered web scraping and data extraction",
      "robots": "follow, index",
      "ogTitle": "Firecrawl",
      "ogDescription": "AI-powered web scraping and data extraction",
      "ogUrl": "https://firecrawl.dev/",
      "ogImage": "https://firecrawl.dev/og.png",
      "ogLocaleAlternate": [],
      "ogSiteName": "Firecrawl",
      "sourceURL": "https://firecrawl.dev/"
    }
  }
}

Page Actions

Firecrawl can perform specific browser interactions before extracting data. This is useful for dealing with dynamic content, navigating menus, or any scenario requiring user input.

The following example demonstrates navigating to Google, searching for "Firecrawl," selecting the first result, and capturing a screenshot. It is important to include wait actions to ensure elements load properly between steps.

from firecrawl import FirecrawlApp

app = FirecrawlApp(api_key="fc-YOUR_API_KEY")

scrape_result = app.scrape_url('firecrawl.dev',
    formats=['markdown', 'html'],
    actions=[
        {"type": "wait", "milliseconds": 2000},
        {"type": "click", "selector": "textarea[title=\"Search\"]"},
        {"type": "wait", "milliseconds": 2000},
        {"type": "write", "text": "firecrawl"},
        {"type": "wait", "milliseconds": 2000},
        {"type": "press", "key": "ENTER"},
        {"type": "wait", "milliseconds": 3000},
        {"type": "click", "selector": "h3"},
        {"type": "wait", "milliseconds": 3000},
        {"type": "scrape"},
        {"type": "screenshot"}
    ]
)
print(scrape_result)

Action Output

{
  "success": true,
  "data": {
    "markdown": "Our first Launch Week is over! [See the recap 🚀](blog/firecrawl-launch-week-1-recap)...",
    "actions": {
      "screenshots": [
        "https://alttmdsdujxrfnakrkyi.supabase.co/storage/v1/object/public/media/screenshot-75ef2d87-31e0-4349-a478-fb432a29e241.png"
      ],
      "scrapes": [
        {
          "url": "https://www.firecrawl.dev/",
          "html": "<html><body><h1>Firecrawl</h1></body></html>"
        }
      ]
    },
    "metadata": {
      "title": "Home - Firecrawl",
      "description": "Firecrawl crawls and converts any website into clean markdown.",
      "language": "en",
      "keywords": "Firecrawl,Markdown,Data,Mendable,Langchain",
      "robots": "follow, index",
      "ogTitle": "Firecrawl",
      "ogDescription": "Turn any website into LLM-ready data.",
      "ogUrl": "https://www.firecrawl.dev/",
      "ogImage": "https://www.firecrawl.dev/og.png?123",
      "ogLocaleAlternate": [],
      "ogSiteName": "Firecrawl",
      "sourceURL": "http://google.com",
      "statusCode": 200
    }
  }
}

Open Source vs. Cloud

Firecrawl is open-source software released under the AGPL-3.0 license, allowing for self-hosting. However, the managed cloud version at firecrawl.dev provides several advanced features not included in the basic core.

Feature	Open Source	Cloud
Scrape	Yes	Yes
Crawl	Yes	Yes
LLM Extract	Yes	Yes
Map	Yes	Yes
LLM-Ready Formats	Yes	Yes
SDKs	Yes	Yes
Bot Bypass	No	Yes
Proxy Rotation	No	Yes
Proxy Dashboard	No	Yes
Actions	No	Yes
Enterprise Headless Browser	No	Yes
Headless Browser Scraping	Yes	Yes