geziyorvsphp-spider
Geziyor is a blazing fast web crawling and web scraping framework. It can be used to crawl websites and extract structured data from them. Geziyor is useful for a wide range of purposes such as data mining, monitoring and automated testing.
Features:
- JS Rendering
- 5.000+ Requests/Sec
- Caching (Memory/Disk/LevelDB)
- Automatic Data Exporting (JSON, CSV, or custom)
- Metrics (Prometheus, Expvar, or custom)
- Limit Concurrency (Global/Per Domain)
- Request Delays (Constant/Randomized)
- Cookies, Middlewares, robots.txt
- Automatic response decoding to UTF-8
- Proxy management (Single, Round-Robin, Custom)
php-spider is a PHP library for web crawling and scraping. It allows developers to easily navigate and extract data from websites by simulating a web browser's behavior.
- supports two traversal algorithms: breadth-first and depth-first
- supports crawl depth limiting, queue size limiting and max downloads limiting
- supports adding custom URI discovery logic, based on XPath, CSS selectors, or plain old PHP
- comes with a useful set of URI filters, such as Domain limiting
- supports custom URI filters, both prefetch (URI) and postfetch (Resource content)
- supports custom request handling logic
- supports Basic, Digest and NTLM HTTP authentication. See example.
- comes with a useful set of persistence handlers (memory, file)
- supports custom persistence handlers
- collects statistics about the crawl for reporting
- dispatches useful events, allowing developers to add even more custom behavior
- supports a politeness policy
This Spider does not support Javascript.
Example Use
// This example extracts all quotes from quotes.toscrape.com and exports to JSON file.
func main() {
geziyor.NewGeziyor(&geziyor.Options{
StartURLs: []string{"http://quotes.toscrape.com/"},
ParseFunc: quotesParse,
Exporters: []export.Exporter{&export.JSON{}},
}).Start()
}
func quotesParse(g *geziyor.Geziyor, r *client.Response) {
r.HTMLDoc.Find("div.quote").Each(func(i int, s *goquery.Selection) {
g.Exports <- map[string]interface{}{
"text": s.Find("span.text").Text(),
"author": s.Find("small.author").Text(),
}
})
if href, ok := r.HTMLDoc.Find("li.next > a").Attr("href"); ok {
g.Get(r.JoinURL(href), quotesParse)
}
}
use Example\StatsHandler;
use VDB\Spider\Discoverer\XPathExpressionDiscoverer;
use Symfony\Contracts\EventDispatcher\Event;
use VDB\Spider\Event\SpiderEvents;
use VDB\Spider\Spider;
require_once('example_complex_bootstrap.php');
// Create Spider
$spider = new Spider('http://dmoztools.net');
// Add a URI discoverer. Without it, the spider does nothing. In this case, we want <a> tags from a certain <div>
$spider->getDiscovererSet()->set(new XPathExpressionDiscoverer("//div[@id='catalogs']//a"));
// Set some sane options for this example. In this case, we only get the first 10 items from the start page.
$spider->getDiscovererSet()->maxDepth = 1;
$spider->getQueueManager()->maxQueueSize = 10;
// Let's add something to enable us to stop the script
$spider->getDispatcher()->addListener(
SpiderEvents::SPIDER_CRAWL_USER_STOPPED,
function (Event $event) {
echo "\nCrawl aborted by user.\n";
exit();
}
);
// Add a listener to collect stats to the Spider and the QueueMananger.
// There are more components that dispatch events you can use.
$statsHandler = new StatsHandler();
$spider->getQueueManager()->getDispatcher()->addSubscriber($statsHandler);
$spider->getDispatcher()->addSubscriber($statsHandler);
// Execute crawl
$spider->crawl();
// Build a report
echo "\n ENQUEUED: " . count($statsHandler->getQueued());
echo "\n SKIPPED: " . count($statsHandler->getFiltered());
echo "\n FAILED: " . count($statsHandler->getFailed());
echo "\n PERSISTED: " . count($statsHandler->getPersisted());
// Finally we could do some processing on the downloaded resources
// In this example, we will echo the title of all resources
echo "\n\nDOWNLOADED RESOURCES: ";
foreach ($spider->getDownloader()->getPersistenceHandler() as $resource) {
echo "\n - " . $resource->getCrawler()->filterXpath('//title')->text();
}