Crawl4AI is an open-source AI-powered web crawling and data extraction library for Python.
It uses large language models (LLMs) to intelligently extract structured data from web pages
with minimal code. Unlike traditional scraping frameworks that rely on CSS selectors or XPath,
Crawl4AI can understand page content semantically and extract data based on natural language
descriptions of what you want.
Key features include:
- LLM-based extraction
Define what data you want in plain English and Crawl4AI uses LLMs to find and extract it
from the page content. Supports multiple LLM providers including OpenAI, Anthropic, and local models.
- Automatic crawling
Built-in crawler with support for JavaScript rendering, parallel crawling, and session management.
- Structured output
Returns data in structured formats (JSON, Pydantic models) making it easy to integrate
into data pipelines.
- Markdown conversion
Can convert web pages to clean markdown format, useful for feeding content to LLMs.
- Chunking strategies
Multiple strategies for breaking down large pages into processable chunks for LLM extraction.
- Async support
Built on async Python for efficient concurrent crawling and extraction.
Crawl4AI is particularly useful for scraping unstructured content where writing traditional
CSS/XPath selectors would be tedious or fragile. It excels at content extraction, article
parsing, and data mining from diverse page layouts.
ralger is a small web scraping framework for R based on rvest and xml2.
It's goal to simplify basic web scraping and it provides a convenient and easy to use API.
It offers functions for retrieving pages, parsing HTML using CSS selectors, automatic table parsing and
auto link, title, image and paragraph extraction.
```python
from crawl4ai import AsyncWebCrawler, CrawlerRunConfig
from crawl4ai.extraction_strategy import LLMExtractionStrategy
import asyncio
async def main():
# Basic crawling - get page as markdown
async with AsyncWebCrawler() as crawler:
result = await crawler.arun(url="https://example.com")
print(result.markdown) # clean markdown content
# AI-powered extraction with structured output
strategy = LLMExtractionStrategy(
instruction="Extract all product names and prices from this page",
)
config = CrawlerRunConfig(extraction_strategy=strategy)
async with AsyncWebCrawler() as crawler:
result = await crawler.arun(
url="https://example.com/products",
config=config,
)
print(result.extracted_content) # structured JSON output
asyncio.run(main())
```
```r
library("ralger")
url <- "http://www.shanghairanking.com/rankings/arwu/2021"
# retrieve HTML and select elements using CSS selectors:
best_uni <- scrap(link = url, node = "a span", clean = TRUE)
head(best_uni, 5)
#> [1] "Harvard University"
#> [2] "Stanford University"
#> [3] "University of Cambridge"
#> [4] "Massachusetts Institute of Technology (MIT)"
#> [5] "University of California, Berkeley"
# ralger can also parse HTML attributes
attributes <- attribute_scrap(
link = "https://ropensci.org/",
node = "a", # the a tag
attr = "class" # getting the class attribute
)
head(attributes, 10) # NA values are a tags without a class attribute
#> [1] "navbar-brand logo" "nav-link" NA
#> [4] NA NA "nav-link"
#> [7] NA "nav-link" NA
#> [10] NA
#
# ralger can automatically scrape tables:
data <- table_scrap(link ="https://www.boxofficemojo.com/chart/top_lifetime_gross/?area=XWW")
head(data)
#> # A tibble: 6 × 4
#> Rank Title `Lifetime Gross` Year
#>
#> 1 1 Avatar $2,847,397,339 2009
#> 2 2 Avengers: Endgame $2,797,501,328 2019
#> 3 3 Titanic $2,201,647,264 1997
#> 4 4 Star Wars: Episode VII - The Force Awakens $2,069,521,700 2015
#> 5 5 Avengers: Infinity War $2,048,359,754 2018
#> 6 6 Spider-Man: No Way Home $1,901,216,740 2021
```