crawleevskimurai
Crawlee is a modern web scraping and browser automation framework for JavaScript and TypeScript, built by Apify. It is the successor to the Apify SDK and provides a unified interface for building reliable web scrapers and crawlers that can scale from simple scripts to large-scale data extraction projects.
Crawlee supports multiple crawling strategies through different crawler classes:
- CheerioCrawler For fast, lightweight HTML scraping using Cheerio (no browser needed). Best for static pages.
- PlaywrightCrawler Uses Playwright for full browser automation. Handles JavaScript-rendered pages, SPAs, and complex interactions.
- PuppeteerCrawler Similar to PlaywrightCrawler but uses Puppeteer as the browser automation backend.
- HttpCrawler Minimal crawler for raw HTTP requests without HTML parsing.
Key features include:
- Automatic request queue management with configurable concurrency and rate limiting
- Built-in proxy rotation with session management
- Persistent request queue and dataset storage (local or cloud via Apify)
- Automatic retry and error handling with configurable strategies
- TypeScript-first design with full type safety
- Middleware-like request/response hooks (preNavigationHooks, postNavigationHooks)
- Output pipelines for storing extracted data
- Easy deployment to Apify cloud platform
Crawlee is considered the most feature-complete web scraping framework in the JavaScript/TypeScript ecosystem, comparable to Python's Scrapy but with native browser automation support.
Kimurai is a modern web scraping framework for Ruby, inspired by Python's Scrapy. It provides a structured approach to building web scrapers with built-in support for multiple browser engines, session management, and data pipelines.
Key features include:
- Multiple engine support Can use different backends depending on the scraping needs: Mechanize for simple HTTP requests, Selenium with headless Chrome/Firefox for JavaScript-rendered pages, and Poltergeist (PhantomJS) for lightweight rendering.
- Scrapy-like architecture Follows the spider pattern: define a spider class with start URLs and parsing methods, and the framework handles crawling, scheduling, and data collection.
- Built-in data pipelines Save scraped data to JSON, CSV, or custom formats with configurable output pipelines.
- Session management Maintains browser sessions with automatic cookie handling and configurable delays between requests.
- Request scheduling Built-in request queue with configurable concurrency, delays, and retry logic.
- CLI tools Command-line tools for generating new spiders, running individual spiders, and managing scraping projects.
Kimurai is the closest Ruby equivalent to Scrapy. It's well-suited for structured scraping projects that need organization, multiple spiders, and data pipeline processing.
Note: Kimurai has not seen active development recently, but it remains a useful framework for Ruby scraping projects and is included as the most complete Ruby scraping framework available.