ralger is a small web scraping framework for R based on rvest and xml2.
It's goal to simplify basic web scraping and it provides a convenient and easy to use API.
It offers functions for retrieving pages, parsing HTML using CSS selectors, automatic table parsing and
auto link, title, image and paragraph extraction.
Katana is a next-generation web crawling and spidering framework written in Go by
ProjectDiscovery. It is designed for fast, comprehensive endpoint and asset discovery
and is widely used in the security research and bug bounty communities.
Katana offers multiple crawling modes:
- Standard mode
Fast HTTP-based crawling without a browser. Parses HTML, JavaScript files, and other
resources to discover endpoints and links.
- Headless mode
Uses a headless Chrome browser for crawling JavaScript-rendered pages and single-page
applications (SPAs).
- Passive mode
Discovers URLs from external sources (Wayback Machine, CommonCrawl, etc.) without
actively visiting the target.
Key features include:
- Scope control
Configurable crawl scope with regex patterns for including/excluding URLs, domains,
and file extensions.
- JavaScript parsing
Extracts endpoints from JavaScript files, inline scripts, and AJAX requests even in
standard (non-headless) mode.
- Customizable output
Filter and format output with field selection, JSON output, and custom templates.
- Rate limiting
Built-in rate limiting and concurrency control to avoid overwhelming targets.
- Proxy support
HTTP and SOCKS5 proxy support with rotation.
- Form filling
Can detect and auto-fill forms to discover endpoints behind form submissions.
While Katana was designed for security research and reconnaissance, its fast crawling
capabilities and JavaScript parsing make it equally useful for web scraping discovery
and sitemap generation.
```r
library("ralger")
url <- "http://www.shanghairanking.com/rankings/arwu/2021"
# retrieve HTML and select elements using CSS selectors:
best_uni <- scrap(link = url, node = "a span", clean = TRUE)
head(best_uni, 5)
#> [1] "Harvard University"
#> [2] "Stanford University"
#> [3] "University of Cambridge"
#> [4] "Massachusetts Institute of Technology (MIT)"
#> [5] "University of California, Berkeley"
# ralger can also parse HTML attributes
attributes <- attribute_scrap(
link = "https://ropensci.org/",
node = "a", # the a tag
attr = "class" # getting the class attribute
)
head(attributes, 10) # NA values are a tags without a class attribute
#> [1] "navbar-brand logo" "nav-link" NA
#> [4] NA NA "nav-link"
#> [7] NA "nav-link" NA
#> [10] NA
#
# ralger can automatically scrape tables:
data <- table_scrap(link ="https://www.boxofficemojo.com/chart/top_lifetime_gross/?area=XWW")
head(data)
#> # A tibble: 6 × 4
#> Rank Title `Lifetime Gross` Year
#>
#> 1 1 Avatar $2,847,397,339 2009
#> 2 2 Avengers: Endgame $2,797,501,328 2019
#> 3 3 Titanic $2,201,647,264 1997
#> 4 4 Star Wars: Episode VII - The Force Awakens $2,069,521,700 2015
#> 5 5 Avengers: Infinity War $2,048,359,754 2018
#> 6 6 Spider-Man: No Way Home $1,901,216,740 2021
```
```go
package main
import (
"context"
"math"
"github.com/projectdiscovery/katana/pkg/engine/standard"
"github.com/projectdiscovery/katana/pkg/output"
"github.com/projectdiscovery/katana/pkg/types"
)
func main() {
// Configure crawl options
options := &types.Options{
MaxDepth: 3,
FieldScope: "rdn", // restrict to root domain
BodyReadSize: math.MaxInt,
Timeout: 10,
Concurrency: 10,
Parallelism: 10,
Delay: 0,
RateLimit: 150,
Strategy: "depth-first",
OnResult: func(result output.Result) {
// Process each discovered URL
println(result.Request.URL)
},
}
// Create and run the crawler
crawlerOptions, _ := types.NewCrawlerOptions(options)
defer crawlerOptions.Close()
crawler, _ := standard.New(crawlerOptions)
defer crawler.Close()
// Start crawling
_ = crawler.Crawl("https://example.com")
}
```