Skip to content

crawleevsphp-spider

Apache-2.0 175 26 22,720
341.9 thousand (month) Apr 22 2022 3.16.0(2026-04-09 07:36:53 ago)
1,341 3 1 MIT
Mar 16 2013 53 (month) v0.7.6(2025-12-04 15:08:06 ago)

Crawlee is a modern web scraping and browser automation framework for JavaScript and TypeScript, built by Apify. It is the successor to the Apify SDK and provides a unified interface for building reliable web scrapers and crawlers that can scale from simple scripts to large-scale data extraction projects.

Crawlee supports multiple crawling strategies through different crawler classes:

  • CheerioCrawler For fast, lightweight HTML scraping using Cheerio (no browser needed). Best for static pages.
  • PlaywrightCrawler Uses Playwright for full browser automation. Handles JavaScript-rendered pages, SPAs, and complex interactions.
  • PuppeteerCrawler Similar to PlaywrightCrawler but uses Puppeteer as the browser automation backend.
  • HttpCrawler Minimal crawler for raw HTTP requests without HTML parsing.

Key features include:

  • Automatic request queue management with configurable concurrency and rate limiting
  • Built-in proxy rotation with session management
  • Persistent request queue and dataset storage (local or cloud via Apify)
  • Automatic retry and error handling with configurable strategies
  • TypeScript-first design with full type safety
  • Middleware-like request/response hooks (preNavigationHooks, postNavigationHooks)
  • Output pipelines for storing extracted data
  • Easy deployment to Apify cloud platform

Crawlee is considered the most feature-complete web scraping framework in the JavaScript/TypeScript ecosystem, comparable to Python's Scrapy but with native browser automation support.

php-spider is a PHP library for web crawling and scraping. It allows developers to easily navigate and extract data from websites by simulating a web browser's behavior.

  • supports two traversal algorithms: breadth-first and depth-first
  • supports crawl depth limiting, queue size limiting and max downloads limiting
  • supports adding custom URI discovery logic, based on XPath, CSS selectors, or plain old PHP
  • comes with a useful set of URI filters, such as Domain limiting
  • supports custom URI filters, both prefetch (URI) and postfetch (Resource content)
  • supports custom request handling logic
  • supports Basic, Digest and NTLM HTTP authentication. See example.
  • comes with a useful set of persistence handlers (memory, file)
  • supports custom persistence handlers
  • collects statistics about the crawl for reporting
  • dispatches useful events, allowing developers to add even more custom behavior
  • supports a politeness policy

This Spider does not support Javascript.

Highlights


populartypescriptextendiblemiddlewaresoutput-pipelineslarge-scaleproxy

Example Use


```javascript import { PlaywrightCrawler, Dataset } from 'crawlee'; // Create a crawler with Playwright for JS rendering const crawler = new PlaywrightCrawler({ // Limit concurrency to avoid overwhelming the target maxConcurrency: 5, // This function is called for each URL async requestHandler({ request, page, enqueueLinks }) { const title = await page.title(); // Extract data from the page const products = await page.$$eval('.product', (els) => els.map((el) => ({ name: el.querySelector('.name')?.textContent, price: el.querySelector('.price')?.textContent, })) ); // Store extracted data await Dataset.pushData({ url: request.url, title, products, }); // Follow links to crawl more pages await enqueueLinks({ globs: ['https://example.com/products/**'], }); }, }); // Start crawling await crawler.run(['https://example.com/products']); ```
```php use Example\StatsHandler; use VDB\Spider\Discoverer\XPathExpressionDiscoverer; use Symfony\Contracts\EventDispatcher\Event; use VDB\Spider\Event\SpiderEvents; use VDB\Spider\Spider; require_once('example_complex_bootstrap.php'); // Create Spider $spider = new Spider('http://dmoztools.net'); // Add a URI discoverer. Without it, the spider does nothing. In this case, we want tags from a certain
$spider->getDiscovererSet()->set(new XPathExpressionDiscoverer("//div[@id='catalogs']//a")); // Set some sane options for this example. In this case, we only get the first 10 items from the start page. $spider->getDiscovererSet()->maxDepth = 1; $spider->getQueueManager()->maxQueueSize = 10; // Let's add something to enable us to stop the script $spider->getDispatcher()->addListener( SpiderEvents::SPIDER_CRAWL_USER_STOPPED, function (Event $event) { echo "\nCrawl aborted by user.\n"; exit(); } ); // Add a listener to collect stats to the Spider and the QueueMananger. // There are more components that dispatch events you can use. $statsHandler = new StatsHandler(); $spider->getQueueManager()->getDispatcher()->addSubscriber($statsHandler); $spider->getDispatcher()->addSubscriber($statsHandler); // Execute crawl $spider->crawl(); // Build a report echo "\n ENQUEUED: " . count($statsHandler->getQueued()); echo "\n SKIPPED: " . count($statsHandler->getFiltered()); echo "\n FAILED: " . count($statsHandler->getFailed()); echo "\n PERSISTED: " . count($statsHandler->getPersisted()); // Finally we could do some processing on the downloaded resources // In this example, we will echo the title of all resources echo "\n\nDOWNLOADED RESOURCES: "; foreach ($spider->getDownloader()->getPersistenceHandler() as $resource) { echo "\n - " . $resource->getCrawler()->filterXpath('//title')->text(); } ```

Alternatives / Similar


Was this page helpful?