Skip to content

node-crawlervskatana

MIT 30 6 6,790
15.3 thousand (month) Sep 10 2012 2.0.2(2025-05-28 09:36:01 ago)
16,499 6 18 MIT
Nov 07 2022 v1.5.0(2026-03-10 14:52:47 ago)

node-crawler is a popular web scraping library for Node.js that allows you to easily navigate and extract data from websites. It has a simple API and supports concurrency, making it efficient for scraping large numbers of pages.

Features:

  • Server-side DOM & automatic jQuery insertion with Cheerio (default) or JSDOM,
  • Configurable pool size and retries,
  • Control rate limit,
  • Priority queue of requests,
  • forceUTF8 mode to let crawler deal for you with charset detection and conversion,
  • Compatible with 4.x or newer version.
  • Http2 support
  • Proxy support

Katana is a next-generation web crawling and spidering framework written in Go by ProjectDiscovery. It is designed for fast, comprehensive endpoint and asset discovery and is widely used in the security research and bug bounty communities.

Katana offers multiple crawling modes:

  • Standard mode Fast HTTP-based crawling without a browser. Parses HTML, JavaScript files, and other resources to discover endpoints and links.
  • Headless mode Uses a headless Chrome browser for crawling JavaScript-rendered pages and single-page applications (SPAs).
  • Passive mode Discovers URLs from external sources (Wayback Machine, CommonCrawl, etc.) without actively visiting the target.

Key features include:

  • Scope control Configurable crawl scope with regex patterns for including/excluding URLs, domains, and file extensions.
  • JavaScript parsing Extracts endpoints from JavaScript files, inline scripts, and AJAX requests even in standard (non-headless) mode.
  • Customizable output Filter and format output with field selection, JSON output, and custom templates.
  • Rate limiting Built-in rate limiting and concurrency control to avoid overwhelming targets.
  • Proxy support HTTP and SOCKS5 proxy support with rotation.
  • Form filling Can detect and auto-fill forms to discover endpoints behind form submissions.

While Katana was designed for security research and reconnaissance, its fast crawling capabilities and JavaScript parsing make it equally useful for web scraping discovery and sitemap generation.

Highlights


fastpopularlarge-scale

Example Use


```javascript const Crawler = require('crawler'); const c = new Crawler({ maxConnections: 10, // This will be called for each crawled page callback: (error, res, done) => { if (error) { console.log(error); } else { const $ = res.$; // $ is Cheerio by default //a lean implementation of core jQuery designed specifically for the server console.log($('title').text()); } done(); } }); // Queue just one URL, with default callback c.queue('http://www.amazon.com'); // Queue a list of URLs c.queue(['http://www.google.com/','http://www.yahoo.com']); // Queue URLs with custom callbacks & parameters c.queue([{ uri: 'http://parishackers.org/', jQuery: false, // The global callback won't be called callback: (error, res, done) => { if (error) { console.log(error); } else { console.log('Grabbed', res.body.length, 'bytes'); } done(); } }]); // Queue some HTML code directly without grabbing (mostly for tests) c.queue([{ html: '

This is a test

' }]); ```
```go package main import ( "context" "math" "github.com/projectdiscovery/katana/pkg/engine/standard" "github.com/projectdiscovery/katana/pkg/output" "github.com/projectdiscovery/katana/pkg/types" ) func main() { // Configure crawl options options := &types.Options{ MaxDepth: 3, FieldScope: "rdn", // restrict to root domain BodyReadSize: math.MaxInt, Timeout: 10, Concurrency: 10, Parallelism: 10, Delay: 0, RateLimit: 150, Strategy: "depth-first", OnResult: func(result output.Result) { // Process each discovered URL println(result.Request.URL) }, } // Create and run the crawler crawlerOptions, _ := types.NewCrawlerOptions(options) defer crawlerOptions.Close() crawler, _ := standard.New(crawlerOptions) defer crawler.Close() // Start crawling _ = crawler.Crawl("https://example.com") } ```

Alternatives / Similar


Was this page helpful?