crwlr-crawlervsphpscraper

MIT 1 2 311

18 (month) Apr 18 2022 v1.9.3(10 days ago)

506 2 22 GPL-3.0-or-later

May 04 2020 104 (month) 3.0.0(3 months ago)

This library provides kind of a framework and a lot of ready to use, so-called steps, that you can use as building blocks, to build your own crawlers and scrapers with.

Some features: - Crawler Politeness innocent (respecting robots.txt, throttling,...) - Load URLs using - a (PSR-18) HTTP client (default is of course Guzzle) - or a headless browser (chrome) to get source after Javascript execution - Get absolute links from HTML documents link - Get sitemaps from robots.txt and get all URLs from those sitemaps - Crawl (load) all pages of a website spider - Use cookies (or don't) cookie - Use any HTTP methods (GET, POST,...) and send any headers or body - Iterate over paginated list pages repeat - Extract data from: - HTML and also XML (using CSS selectors or XPath queries) - JSON (using dot notation) - CSV (map columns) - Extract schema.org structured data in JSON-LD format from HTML documents - Keep memory usage low by using PHP Generators muscle - Cache HTTP responses during development, so you don't have to load pages again and again after every code change - Get logs about what your crawler is doing (accepts any PSR-3 LoggerInterface)

PHPScraper is a universal web-util for PHP. The main goal is to get stuff done instead of getting distracted with selectors, preparing & converting data structures, etc. Instead, you can just go to a website and get the relevant information for your project.

PHPScraper is a minimalistic scraper framework that is built on top of other popular scraping tools.

Features:

Direct access to page basic features like: Meta data, Links, Images, Headings, Content, Keywords etc.
File downloading.
RSS, Sitemap and other feed processing.
CSV, XML and JSON file processing.

Example Use

<?php
require_once 'vendor/autoload.php';

use Crwlr\Crawler;

$crawler = new Crawler();
$crawler->get('https://example.com', ['User-Agent' => 'webscraping.fyi']);


// more links can be followed:
$crawler->followLinks();

// and current page can be parsed:
$response = $crawler->response();
$title = $crawler->filter('title')->text();
echo $response->getContent();

</div>
<div class="lib-example" markdown>

```javascript
// create scraper object
$web = new \Spekulatius\PHPScraper\PHPScraper;
// go to URL
$web->go('https://test-pages.phpscraper.de/content/selectors.html');

// elements can be found using XPath:
echo $web->filter("//*[@id='by-id']")->text();   // "Content by ID"

// or pre-defined variables covering basic page data:
$web->links;  // for all links
$web->headings;
$web->images;
$web->contentKeywords;
$web->orderedLists;
$web->unorderedLists;
$web->paragraphs;
$web->outline;  // basic page outline
$web->cleanOutlineWithParagraphs;  // basic page outline

Alternatives / Similar

colly

22,459 compare

pholcus

7,554 compare

geziyor

2,488 compare

dataflowkit

651 compare

scrapy

51,636 compare

rvest

1,485 compare

ferret

5,663 compare

gocrawl

2,029 compare

node-crawler

6,650 compare

scrapyd

2,884 compare

panther

2,907 compare

autoscraper

6,060 compare

spidr

795 compare

wombat

1,308 compare

scrapydweb

3,048 compare

ralger

153 compare

gracy

240 compare

gerapy

3,257 compare

ruia

1,737 compare

roach

1,337 compare

photon

10,672 compare

phpscraper

506 compare

php-spider

1,328 compare

ayakashi

200 compare

dude

413 compare