Skip to content

ralgervsphp-spider

MIT 3 1 165
327 (month) Dec 22 2019 2.3.0(2021-03-18 00:10:00 ago)
1,341 3 1 MIT
Mar 16 2013 53 (month) v0.7.6(2025-12-04 15:08:06 ago)

ralger is a small web scraping framework for R based on rvest and xml2.

It's goal to simplify basic web scraping and it provides a convenient and easy to use API.

It offers functions for retrieving pages, parsing HTML using CSS selectors, automatic table parsing and auto link, title, image and paragraph extraction.

php-spider is a PHP library for web crawling and scraping. It allows developers to easily navigate and extract data from websites by simulating a web browser's behavior.

  • supports two traversal algorithms: breadth-first and depth-first
  • supports crawl depth limiting, queue size limiting and max downloads limiting
  • supports adding custom URI discovery logic, based on XPath, CSS selectors, or plain old PHP
  • comes with a useful set of URI filters, such as Domain limiting
  • supports custom URI filters, both prefetch (URI) and postfetch (Resource content)
  • supports custom request handling logic
  • supports Basic, Digest and NTLM HTTP authentication. See example.
  • comes with a useful set of persistence handlers (memory, file)
  • supports custom persistence handlers
  • collects statistics about the crawl for reporting
  • dispatches useful events, allowing developers to add even more custom behavior
  • supports a politeness policy

This Spider does not support Javascript.

Example Use


```r library("ralger") url <- "http://www.shanghairanking.com/rankings/arwu/2021" # retrieve HTML and select elements using CSS selectors: best_uni <- scrap(link = url, node = "a span", clean = TRUE) head(best_uni, 5) #> [1] "Harvard University" #> [2] "Stanford University" #> [3] "University of Cambridge" #> [4] "Massachusetts Institute of Technology (MIT)" #> [5] "University of California, Berkeley" # ralger can also parse HTML attributes attributes <- attribute_scrap( link = "https://ropensci.org/", node = "a", # the a tag attr = "class" # getting the class attribute ) head(attributes, 10) # NA values are a tags without a class attribute #> [1] "navbar-brand logo" "nav-link" NA #> [4] NA NA "nav-link" #> [7] NA "nav-link" NA #> [10] NA # # ralger can automatically scrape tables: data <- table_scrap(link ="https://www.boxofficemojo.com/chart/top_lifetime_gross/?area=XWW") head(data) #> # A tibble: 6 × 4 #> Rank Title `Lifetime Gross` Year #> #> 1 1 Avatar $2,847,397,339 2009 #> 2 2 Avengers: Endgame $2,797,501,328 2019 #> 3 3 Titanic $2,201,647,264 1997 #> 4 4 Star Wars: Episode VII - The Force Awakens $2,069,521,700 2015 #> 5 5 Avengers: Infinity War $2,048,359,754 2018 #> 6 6 Spider-Man: No Way Home $1,901,216,740 2021 ```
```php use Example\StatsHandler; use VDB\Spider\Discoverer\XPathExpressionDiscoverer; use Symfony\Contracts\EventDispatcher\Event; use VDB\Spider\Event\SpiderEvents; use VDB\Spider\Spider; require_once('example_complex_bootstrap.php'); // Create Spider $spider = new Spider('http://dmoztools.net'); // Add a URI discoverer. Without it, the spider does nothing. In this case, we want tags from a certain
$spider->getDiscovererSet()->set(new XPathExpressionDiscoverer("//div[@id='catalogs']//a")); // Set some sane options for this example. In this case, we only get the first 10 items from the start page. $spider->getDiscovererSet()->maxDepth = 1; $spider->getQueueManager()->maxQueueSize = 10; // Let's add something to enable us to stop the script $spider->getDispatcher()->addListener( SpiderEvents::SPIDER_CRAWL_USER_STOPPED, function (Event $event) { echo "\nCrawl aborted by user.\n"; exit(); } ); // Add a listener to collect stats to the Spider and the QueueMananger. // There are more components that dispatch events you can use. $statsHandler = new StatsHandler(); $spider->getQueueManager()->getDispatcher()->addSubscriber($statsHandler); $spider->getDispatcher()->addSubscriber($statsHandler); // Execute crawl $spider->crawl(); // Build a report echo "\n ENQUEUED: " . count($statsHandler->getQueued()); echo "\n SKIPPED: " . count($statsHandler->getFiltered()); echo "\n FAILED: " . count($statsHandler->getFailed()); echo "\n PERSISTED: " . count($statsHandler->getPersisted()); // Finally we could do some processing on the downloaded resources // In this example, we will echo the title of all resources echo "\n\nDOWNLOADED RESOURCES: "; foreach ($spider->getDownloader()->getPersistenceHandler() as $resource) { echo "\n - " . $resource->getCrawler()->filterXpath('//title')->text(); } ```

Alternatives / Similar


Was this page helpful?