Skip to content


MIT 2 2 300
18 (month) Apr 18 2022 v1.7.2(2 months ago)
498 2 22 GPL-3.0-or-later
May 04 2020 105 (month) 3.0.0(a month ago)

This library provides kind of a framework and a lot of ready to use, so-called steps, that you can use as building blocks, to build your own crawlers and scrapers with.

Some features: - Crawler Politeness innocent (respecting robots.txt, throttling,...) - Load URLs using - a (PSR-18) HTTP client (default is of course Guzzle) - or a headless browser (chrome) to get source after Javascript execution - Get absolute links from HTML documents link - Get sitemaps from robots.txt and get all URLs from those sitemaps - Crawl (load) all pages of a website spider - Use cookies (or don't) cookie - Use any HTTP methods (GET, POST,...) and send any headers or body - Iterate over paginated list pages repeat - Extract data from: - HTML and also XML (using CSS selectors or XPath queries) - JSON (using dot notation) - CSV (map columns) - Extract structured data in JSON-LD format from HTML documents - Keep memory usage low by using PHP Generators muscle - Cache HTTP responses during development, so you don't have to load pages again and again after every code change - Get logs about what your crawler is doing (accepts any PSR-3 LoggerInterface)

PHPScraper is a universal web-util for PHP. The main goal is to get stuff done instead of getting distracted with selectors, preparing & converting data structures, etc. Instead, you can just go to a website and get the relevant information for your project.

PHPScraper is a minimalistic scraper framework that is built on top of other popular scraping tools.


  • Direct access to page basic features like: Meta data, Links, Images, Headings, Content, Keywords etc.
  • File downloading.
  • RSS, Sitemap and other feed processing.
  • CSV, XML and JSON file processing.

Example Use

require_once 'vendor/autoload.php';

use Crwlr\Crawler;

$crawler = new Crawler();
$crawler->get('', ['User-Agent' => '']);

// more links can be followed:

// and current page can be parsed:
$response = $crawler->response();
$title = $crawler->filter('title')->text();
echo $response->getContent();
<div class="lib-example" markdown>

// create scraper object
$web = new \Spekulatius\PHPScraper\PHPScraper;
// go to URL

// elements can be found using XPath:
echo $web->filter("//*[@id='by-id']")->text();   // "Content by ID"

// or pre-defined variables covering basic page data:
$web->links;  // for all links
$web->outline;  // basic page outline
$web->cleanOutlineWithParagraphs;  // basic page outline

Alternatives / Similar

Was this page helpful?