Skip to content

phpscrapervscrwlr-crawler

GPL-3.0-or-later 21 2 472
127 (month) May 04 2020 2.0.0(8 months ago)
291 2 1 MIT
v1.6.1(3 days ago) Apr 18 2022 13 (month)

PHPScraper is a universal web-util for PHP. The main goal is to get stuff done instead of getting distracted with selectors, preparing & converting data structures, etc. Instead, you can just go to a website and get the relevant information for your project.

PHPScraper is a minimalistic scraper framework that is built on top of other popular scraping tools.

Features:

  • Direct access to page basic features like: Meta data, Links, Images, Headings, Content, Keywords etc.
  • File downloading.
  • RSS, Sitemap and other feed processing.
  • CSV, XML and JSON file processing.

This library provides kind of a framework and a lot of ready to use, so-called steps, that you can use as building blocks, to build your own crawlers and scrapers with.

Some features: - Crawler Politeness innocent (respecting robots.txt, throttling,...) - Load URLs using - a (PSR-18) HTTP client (default is of course Guzzle) - or a headless browser (chrome) to get source after Javascript execution - Get absolute links from HTML documents link - Get sitemaps from robots.txt and get all URLs from those sitemaps - Crawl (load) all pages of a website spider - Use cookies (or don't) cookie - Use any HTTP methods (GET, POST,...) and send any headers or body - Iterate over paginated list pages repeat - Extract data from: - HTML and also XML (using CSS selectors or XPath queries) - JSON (using dot notation) - CSV (map columns) - Extract schema.org structured data in JSON-LD format from HTML documents - Keep memory usage low by using PHP Generators muscle - Cache HTTP responses during development, so you don't have to load pages again and again after every code change - Get logs about what your crawler is doing (accepts any PSR-3 LoggerInterface)

Example Use


// create scraper object
$web = new \Spekulatius\PHPScraper\PHPScraper;
// go to URL
$web->go('https://test-pages.phpscraper.de/content/selectors.html');

// elements can be found using XPath:
echo $web->filter("//*[@id='by-id']")->text();   // "Content by ID"

// or pre-defined variables covering basic page data:
$web->links;  // for all links
$web->headings;
$web->images;
$web->contentKeywords;
$web->orderedLists;
$web->unorderedLists;
$web->paragraphs;
$web->outline;  // basic page outline
$web->cleanOutlineWithParagraphs;  // basic page outline
<?php
require_once 'vendor/autoload.php';

use Crwlr\Crawler;

$crawler = new Crawler();
$crawler->get('https://example.com', ['User-Agent' => 'webscraping.fyi']);


// more links can be followed:
$crawler->followLinks();

// and current page can be parsed:
$response = $crawler->response();
$title = $crawler->filter('title')->text();
echo $response->getContent();
```

Alternatives / Similar