Skip to content


MIT 64 4 3,199
545 (month) Jul 04 2017 0.9.13(8 months ago)
1,320 1 5 MIT
v0.7.2(3 months ago) Mar 16 2013 65 (month)

Gerapy is a Distributed Crawler Management Framework Based on Scrapy, Scrapyd, Scrapyd-Client, Scrapyd-API, Django and Vue.js.

It is built on top of the Scrapy framework and provides a simple and easy-to-use interface for performing web scraping tasks. Gerapy also includes features such as support for scheduling and distributed crawling, as well as a built-in web-based dashboard for monitoring and managing scraping tasks. Additionally, Gerapy is designed to be highly extensible, allowing users to easily create custom plugins and integrations.

Overall, Gerapy is a useful tool for those looking to automate web scraping tasks and extract data from websites.

php-spider is a PHP library for web crawling and scraping. It allows developers to easily navigate and extract data from websites by simulating a web browser's behavior.

  • supports two traversal algorithms: breadth-first and depth-first
  • supports crawl depth limiting, queue size limiting and max downloads limiting
  • supports adding custom URI discovery logic, based on XPath, CSS selectors, or plain old PHP
  • comes with a useful set of URI filters, such as Domain limiting
  • supports custom URI filters, both prefetch (URI) and postfetch (Resource content)
  • supports custom request handling logic
  • supports Basic, Digest and NTLM HTTP authentication. See example.
  • comes with a useful set of persistence handlers (memory, file)
  • supports custom persistence handlers
  • collects statistics about the crawl for reporting
  • dispatches useful events, allowing developers to add even more custom behavior
  • supports a politeness policy

This Spider does not support Javascript.

Example Use

use Example\StatsHandler;
use VDB\Spider\Discoverer\XPathExpressionDiscoverer;
use Symfony\Contracts\EventDispatcher\Event;
use VDB\Spider\Event\SpiderEvents;
use VDB\Spider\Spider;


// Create Spider
$spider = new Spider('');

// Add a URI discoverer. Without it, the spider does nothing. In this case, we want <a> tags from a certain <div>
$spider->getDiscovererSet()->set(new XPathExpressionDiscoverer("//div[@id='catalogs']//a"));

// Set some sane options for this example. In this case, we only get the first 10 items from the start page.
$spider->getDiscovererSet()->maxDepth = 1;
$spider->getQueueManager()->maxQueueSize = 10;

// Let's add something to enable us to stop the script
    function (Event $event) {
        echo "\nCrawl aborted by user.\n";

// Add a listener to collect stats to the Spider and the QueueMananger.
// There are more components that dispatch events you can use.
$statsHandler = new StatsHandler();

// Execute crawl

// Build a report
echo "\n  ENQUEUED:  " . count($statsHandler->getQueued());
echo "\n  SKIPPED:   " . count($statsHandler->getFiltered());
echo "\n  FAILED:    " . count($statsHandler->getFailed());
echo "\n  PERSISTED:    " . count($statsHandler->getPersisted());

// Finally we could do some processing on the downloaded resources
// In this example, we will echo the title of all resources
foreach ($spider->getDownloader()->getPersistenceHandler() as $resource) {
    echo "\n - " . $resource->getCrawler()->filterXpath('//title')->text();

Alternatives / Similar