Firecrawl is an AI-powered web scraping API that converts web pages into clean Markdown or
structured data, optimized for use with large language models (LLMs) and retrieval-augmented
generation (RAG) pipelines. It handles JavaScript rendering, anti-bot bypass, and content
extraction automatically.
Firecrawl offers multiple modes:
- Scrape
Convert a single URL into clean Markdown, HTML, or structured data. Handles JavaScript
rendering and anti-bot protections automatically.
- Crawl
Crawl an entire website starting from a URL, with configurable depth, URL patterns,
and page limits. Returns all pages as clean Markdown.
- Map
Quickly discover all URLs on a website without fully scraping each page. Useful for
sitemap generation and crawl planning.
- Extract
Use LLMs to extract specific structured data from pages based on a schema definition.
Key features:
- Clean Markdown output ideal for LLM context windows
- Automatic JavaScript rendering with headless browsers
- Built-in anti-bot bypass for protected websites
- Structured extraction with JSON schemas
- Batch crawling with webhook notifications
- Python and JavaScript SDKs
Firecrawl is a commercial API service (requires API key, has a free tier) backed by
Y Combinator. It has become one of the most popular tools for feeding web content
into AI applications and is widely used in the LLM/RAG ecosystem.
Note: while the primary service is an API, the core is open source and can be self-hosted.
ralger is a small web scraping framework for R based on rvest and xml2.
It's goal to simplify basic web scraping and it provides a convenient and easy to use API.
It offers functions for retrieving pages, parsing HTML using CSS selectors, automatic table parsing and
auto link, title, image and paragraph extraction.
```python
from firecrawl import FirecrawlApp
app = FirecrawlApp(api_key="YOUR_API_KEY")
# Scrape a single page - get clean markdown
result = app.scrape_url("https://example.com/blog/article")
print(result["markdown"]) # clean markdown content
# Extract structured data with a schema
result = app.scrape_url(
"https://example.com/product/123",
params={
"formats": ["extract"],
"extract": {
"schema": {
"type": "object",
"properties": {
"name": {"type": "string"},
"price": {"type": "number"},
"description": {"type": "string"},
},
}
},
},
)
print(result["extract"]) # {"name": "...", "price": 29.99, ...}
# Crawl an entire website
crawl_result = app.crawl_url(
"https://example.com",
params={"limit": 100, "scrapeOptions": {"formats": ["markdown"]}},
)
for page in crawl_result["data"]:
print(page["metadata"]["title"], page["markdown"][:100])
# Map all URLs on a site
map_result = app.map_url("https://example.com")
print(f"Found {len(map_result['links'])} URLs")
```
```r
library("ralger")
url <- "http://www.shanghairanking.com/rankings/arwu/2021"
# retrieve HTML and select elements using CSS selectors:
best_uni <- scrap(link = url, node = "a span", clean = TRUE)
head(best_uni, 5)
#> [1] "Harvard University"
#> [2] "Stanford University"
#> [3] "University of Cambridge"
#> [4] "Massachusetts Institute of Technology (MIT)"
#> [5] "University of California, Berkeley"
# ralger can also parse HTML attributes
attributes <- attribute_scrap(
link = "https://ropensci.org/",
node = "a", # the a tag
attr = "class" # getting the class attribute
)
head(attributes, 10) # NA values are a tags without a class attribute
#> [1] "navbar-brand logo" "nav-link" NA
#> [4] NA NA "nav-link"
#> [7] NA "nav-link" NA
#> [10] NA
#
# ralger can automatically scrape tables:
data <- table_scrap(link ="https://www.boxofficemojo.com/chart/top_lifetime_gross/?area=XWW")
head(data)
#> # A tibble: 6 × 4
#> Rank Title `Lifetime Gross` Year
#>
#> 1 1 Avatar $2,847,397,339 2009
#> 2 2 Avengers: Endgame $2,797,501,328 2019
#> 3 3 Titanic $2,201,647,264 1997
#> 4 4 Star Wars: Episode VII - The Force Awakens $2,069,521,700 2015
#> 5 5 Avengers: Infinity War $2,048,359,754 2018
#> 6 6 Spider-Man: No Way Home $1,901,216,740 2021
```