name: crawl4ai description: "Comprehensive guide for Crawl4AI - open-source LLM-friendly web crawler and scraper with 62k+ GitHub stars. Covers CLI commands (crwl), Python API, Docker API, deep crawling strategies (BFS, DFS, best-first), content extraction (CSS selectors, LLM-based extraction), markdown/json output, anti-bot bypass, and advanced features for RAG pipelines, data extraction, and batch URL processing."

Crawl4AI - Open Source Web Crawler

Full CLI docs: docs.crawl4ai.com/core/cli/
Ready-to-use scripts: See scripts/ folder
Detailed SDK reference: See references/ folder

Crawl4AI is an open-source (Apache 2.0), LLM-friendly web crawler that transforms websites into clean Markdown or structured JSON. 62k+ GitHub stars.

Verify Availability

Assume Crawl4AI is already installed. Verify the interface you plan to use instead of repeating installation steps.

# CLI present on PATH
which crwl

# CLI responds (there is no top-level --version flag)
crwl crawl -h

# Python package version when using the pipx interpreter
~/.local/pipx/venvs/crawl4ai/bin/python - <<'PY'
from importlib.metadata import version
print(version('crawl4ai'))
PY

# Docker health and version
curl http://localhost:11235/health
docker ps --filter "name=crawl4ai" --format "{{.Image}} {{.Status}}"

Verified in session: CLI at /Users/Y_ALTAY1/.local/bin/crwl, Python package 0.8.6 in the pipx interpreter, Docker health endpoint reporting version 0.8.5.

Docker endpoints: Dashboard http://localhost:11235/dashboard · Playground http://localhost:11235/playground · API http://localhost:11235/crawl

Default Choice

Use the interface that matches the job:

Situation	Default
Agent automation, custom scripts, extraction pipelines	Python SDK
Quick ad-hoc scrape from terminal	CLI
Shared service, remote access, dashboard, batch API	Docker API

Recommended default: prefer the Python SDK when you are building repeatable automation or agent workflows. It gives more control than the CLI and avoids the operational overhead of Docker.

Python Import Caveat

If Crawl4AI was installed with pipx, plain python3 usually will not import crawl4ai because pipx keeps the package in its own virtual environment.

# CLI still works
crwl https://example.com

# But system python3 may not see the package
python3 -c "import crawl4ai"

# Use the pipx interpreter for SDK scripts
~/.local/pipx/venvs/crawl4ai/bin/python your_script.py

Use python3 normally only when crawl4ai is installed in that exact Python environment. For this skill, prefer the pipx interpreter path when using the SDK.

Ready-to-Use Scripts

The scripts/ folder contains ready-to-use Python scripts:

Script	Usage
`basic_crawler.py`	Simple markdown extraction
`batch_crawler.py`	Batch process multiple URLs
`extraction_pipeline.py`	Data extraction with auto-schema generation

# Run scripts
python scripts/basic_crawler.py https://example.com
python scripts/batch_crawler.py urls.txt
python scripts/extraction_pipeline.py --generate-schema https://shop.com "extract products"

Docker API Features

The Docker API provides advanced capabilities beyond the CLI:

Feature	Description
Batch processing	Crawl 100s of URLs in a single request
Async queue	Submit job, get task ID, check results later
Dashboard UI	Visual monitoring at http://localhost:11235
Remote deployment	Run on server, access via API
Priority queuing	Set priority for important jobs first
Better scaling	Multiple containers, load balancing
Centralized caching	Shared cache across requests

Docker API Endpoints

# Health check
curl http://localhost:11235/health

# Submit crawl job
curl -X POST http://localhost:11235/crawl \
  -H "Content-Type: application/json" \
  -d '{"urls": ["https://example.com"], "priority": 10}'

# Get task status (if async)
curl http://localhost:11235/crawl/{task_id}

# Batch crawl multiple URLs
curl -X POST http://localhost:11235/crawl \
  -H "Content-Type: application/json" \
  -d '{
    "urls": [
      "https://altin.doviz.com/gram-altin",
      "https://kur.doviz.com/serbest-piyasa/amerikan-dolari"
    ]
  }'

When to Use Docker vs CLI

Use Case	Best Choice
Quick one-off scrape	CLI (`crwl`)
Production monitoring	Docker API
Batch URLs (100s)	Docker API
Local dev/testing	CLI
Schedule recurring jobs	Docker API
LLM extraction with schema	Docker API

Quick CLI Reference

# Basic crawl
crwl https://example.com

# Output formats: all, json, markdown, md-fit (markdown-fit)
crwl https://example.com -o json
crwl https://example.com -o markdown-fit

# Output to file
crwl https://example.com -o json -O output.json

# Bypass cache (force fresh crawl)
crwl https://example.com -bc

# Deep crawl: bfs, dfs, best-first
crwl https://docs.site.com --deep-crawl bfs --max-pages 10

# LLM Q&A - ask questions about page content
crwl https://example.com -q "What is the main topic?"

# LLM extraction - extract structured data with AI
crwl https://example.com -j "extract all prices and dates"

# JSON schema extraction - extract with CSS selectors
crwl https://example.com -s schema.json -o json

# With config files
crwl https://example.com -B browser.yml -C crawler.yml

# Browser parameters inline
crwl https://example.com -b "headless=true,timeout=30000"

# Crawler parameters (user-agent, etc.)
crwl https://example.com -c "user_agent_mode=random"

# Anti-bot bypass for protected sites (Reddit, etc.)
crwl https://www.reddit.com/r/politics/hot/ -o markdown -c "user_agent_mode=random"

Python API

Use this as the default programmable interface. It worked in session for both normal pages and protected Reddit URLs/search pages when using user_agent_mode="random".

Basic Usage

import asyncio
from crawl4ai import AsyncWebCrawler

async def main():
    async with AsyncWebCrawler() as crawler:
        result = await crawler.arun(url="https://example.com")
        print(result.markdown)

asyncio.run(main())

With Configuration

import asyncio
from crawl4ai import AsyncWebCrawler, BrowserConfig, CrawlerRunConfig, CacheMode

async def main():
    browser = BrowserConfig(headless=True, user_agent_mode="random")
    run_cfg = CrawlerRunConfig(
        cache_mode=CacheMode.BYPASS,
        wait_until="networkidle",
        word_count_threshold=100,
    )

    async with AsyncWebCrawler(config=browser) as crawler:
        result = await crawler.arun(url="https://example.com", config=run_cfg)
        print(result.success)
        print(result.metadata.get("title"))

asyncio.run(main())

Recommended Default Pattern

import asyncio
from crawl4ai import AsyncWebCrawler, BrowserConfig, CrawlerRunConfig

async def main():
    browser = BrowserConfig(headless=True, user_agent_mode="random")
    run_cfg = CrawlerRunConfig(wait_until="networkidle")

    async with AsyncWebCrawler(config=browser) as crawler:
        result = await crawler.arun(
            url="https://www.reddit.com/search/?q=gold%20price%20drop&sort=relevance",
            config=run_cfg,
        )
        print(result.success)
        print(result.metadata.get("title"))
        print(result.markdown[:1000])

asyncio.run(main())

Use this pattern first, then add extraction, proxies, hooks, or deep crawling as needed.

Run SDK scripts with the pipx interpreter when python3 cannot import the package:

~/.local/pipx/venvs/crawl4ai/bin/python your_script.py

CrawlResult Properties

Property	Description
`url`	Crawled URL
`markdown`	Clean markdown content
`html`	Raw HTML
`success`	Boolean success status
`links`	Internal/external links dict
`metadata`	Page metadata
`extracted_content`	Structured data (if extraction enabled)

Deep Crawling

For detailed deep crawling documentation, see references/deep-crawling.md.

Strategies

Strategy	Description	Best For
`BFSDeepCrawlStrategy`	Breadth-first (level by level)	Comprehensive coverage
`DFSDeepCrawlStrategy`	Depth-first (follows one path)	Documentation sites
`BestFirstCrawlingStrategy`	Priority-based with scorers	Focused relevance

See references/deep-crawling.md for streaming, filtering, crash recovery, and prefetch mode.

Extraction Strategies

For detailed extraction docs, see references/extraction.md.

Strategy	Use Case
CSS-based	Fast, no AI needed - define JSON schema
LLM-based	Complex pages, AI extracts structured data

# CSS extraction example
schema = {"name": "Products", "baseSelector": ".product", 
    "fields": [{"name": "title", "selector": "h3", "type": "text"}]}
config = CrawlerRunConfig(extraction_strategy=JsonCssExtractionStrategy(schema))

Advanced Features

For complete advanced features (proxy, session, hooks, anti-bot), see references/config.md.

Proxy Configuration

from crawl4ai.async_configs import ProxyConfig

config = CrawlerRunConfig(
    proxy_config=[
        ProxyConfig.DIRECT,
        ProxyConfig(server="http://proxy:8080", username="user", password="pass"),
    ],
    max_retries=2
)

Session Management

browser_config = BrowserConfig(
    user_data_dir="/path/to/profile",
    use_persistent_context=True
)

Page Interaction (JavaScript)

config = CrawlerRunConfig(
    js_code=["(async () => { await document.querySelector('button').click(); })()"],
    wait_until="networkidle"
)

Hooks

async def on_page_created(page, context, **kwargs):
    await context.route("**/*.png", lambda r: r.abort())
    return page

config = CrawlerRunConfig(hooks={"on_page_context_created": on_page_created})

Anti-Bot (v0.8.5+)

config = CrawlerRunConfig(
    proxy_config=[ProxyConfig.DIRECT, ProxyConfig(server="http://residential-proxy:8080")],
    max_retries=2,
    flatten_shadow_dom=True
)

Docker API

import requests

# Single URL crawl
resp = requests.post("http://localhost:11235/crawl", json={
    "urls": ["https://example.com"],
    "priority": 10
})
result = resp.json()

# Batch multiple URLs (recommended for parallel crawling)
resp = requests.post("http://localhost:11235/crawl", json={
    "urls": [
        "https://www.reddit.com/r/politics/hot/",
        "https://www.reddit.com/r/politics/new/",
        "https://www.reddit.com/r/politics/top/?t=week"
    ],
    "options": {
        "output_format": "markdown",
        "user_agent_mode": "random"
    }
})
results = resp.json()["results"]  # Array of crawl results

CLI vs Python SDK vs Docker

Aspect	CLI	Python SDK	Docker API
Best for	Quick terminal use	Programmable automation	Shared service/API
Speed	Fastest for one-offs	Fast	Slightly slower (HTTP overhead)
Control	Lowest	Highest	Medium
Batch URLs	Limited	Full control in code	Yes (JSON array)
Web UI	No	No	Dashboard + Playground
REST API	No	No	Yes
Deep crawl	Simple flags	Full strategy objects	Request payload
Anti-bot tuning	Basic	Better	Better
Operational overhead	Lowest	Low	Highest

When to Use Which

Use Case	Choose
Quick one-off scrape	CLI
Scripted automation	Python SDK
Build custom extraction pipeline	Python SDK
Repeatable agent workflow	Python SDK
API for other services	Docker API
Dashboard / playground / remote service	Docker API
Batch parallel crawl from another app	Docker API
Fast terminal debugging	CLI

Common Use Cases

Use Case	Approach
RAG Pipeline	`crwl <url> -o markdown-fit`
News Aggregation	Batch crawl multiple sources
Documentation	Deep crawl with BFS
Site Mapping	Prefetch mode
Price Monitoring	Scheduled crawl + extraction
Lead Generation	LLM extraction for contact info

Best Practices

Start simple - test single page before deep crawl
Use filters - target specific content
Limit scope - use max_pages to control costs
Handle blocks - add proxies for protected sites
Monitor - use dashboard for production runs

Troubleshooting

Issue	Solution
`python3` cannot import `crawl4ai`	If installed via `pipx`, use `~/.local/pipx/venvs/crawl4ai/bin/python` or install into a project venv
No output	Check browser accessibility, try `CacheMode.BYPASS`
Anti-bot blocks (Reddit, etc.)	Use `user_agent_mode=random` (CLI: `-c "user_agent_mode=random"`, Docker: `"user_agent_mode": "random"`)
Need a default interface	Use Python SDK for automation, CLI for quick manual runs, Docker API for service-style usage
Reddit shows "Prove your human"	Add `user_agent_mode=random` - works reliably
Slow performance	Use prefetch mode, enable caching
JS not rendering	Increase timeout, check `wait_until`

ナビゲーション

Skillsとは？

リンク

crawl4ai