panthervscrwlr-crawler

MIT 200 4 2,977

8.8 thousand (month) Jul 17 2018 v2.2.0(6 months ago)

356 2 2 MIT

Apr 18 2022 21 (month) v3.2.3(6 months ago)

Panther is a convenient standalone library to scrape websites and to run end-to-end tests using real browsers.

Panther is super powerful. It leverages the W3C's WebDriver protocol to drive native web browsers such as Google Chrome and Firefox.

Panther is very easy to use, because it implements Symfony's popular BrowserKit and DomCrawler APIs, and contains all the features you need to test your apps. It will sound familiar if you have ever created a functional test for a Symfony app: as the API is exactly the same! Keep in mind that Panther can be used in every PHP project, as it is a standalone library.

Panther automatically finds your local installation of Chrome or Firefox and launches them, so you don't need to install anything else on your computer, a Selenium server is not needed!

In test mode, Panther automatically starts your application using the PHP built-in web-server. You can focus on writing your tests or web-scraping scenario and Panther will take care of everything else.

Features:

executes the JavaScript code contained in webpages
supports everything that Chrome (or Firefox) implements
allows taking screenshots
can wait for asynchronously loaded elements to show up
lets you run your own JS code or XPath queries in the context of the loaded page
supports custom Selenium server installations
supports remote browser testing services including SauceLabs and BrowserStack

This library provides kind of a framework and a lot of ready to use, so-called steps, that you can use as building blocks, to build your own crawlers and scrapers with.

Some features: - Crawler Politeness innocent (respecting robots.txt, throttling,...) - Load URLs using - a (PSR-18) HTTP client (default is of course Guzzle) - or a headless browser (chrome) to get source after Javascript execution - Get absolute links from HTML documents link - Get sitemaps from robots.txt and get all URLs from those sitemaps - Crawl (load) all pages of a website spider - Use cookies (or don't) cookie - Use any HTTP methods (GET, POST,...) and send any headers or body - Iterate over paginated list pages repeat - Extract data from: - HTML and also XML (using CSS selectors or XPath queries) - JSON (using dot notation) - CSV (map columns) - Extract schema.org structured data in JSON-LD format from HTML documents - Keep memory usage low by using PHP Generators muscle - Cache HTTP responses during development, so you don't have to load pages again and again after every code change - Get logs about what your crawler is doing (accepts any PSR-3 LoggerInterface)

Example Use

<?php

use Symfony\Component\Panther\Client;

require __DIR__.'/vendor/autoload.php'; // Composer's autoloader

$client = Client::createChromeClient();
// Or, if you care about the open web and prefer to use Firefox
$client = Client::createFirefoxClient();

$client->request('GET', 'https://api-platform.com'); // Yes, this website is 100% written in JavaScript
$client->clickLink('Get started');

// Wait for an element to be present in the DOM (even if hidden)
$crawler = $client->waitFor('#installing-the-framework');
// Alternatively, wait for an element to be visible
$crawler = $client->waitForVisibility('#installing-the-framework');

echo $crawler->filter('#installing-the-framework')->text();
$client->takeScreenshot('screen.png'); // Yeah, screenshot!

<?php
require_once 'vendor/autoload.php';

use Crwlr\Crawler;

$crawler = new Crawler();
$crawler->get('https://example.com', ['User-Agent' => 'webscraping.fyi']);


// more links can be followed:
$crawler->followLinks();

// and current page can be parsed:
$response = $crawler->response();
$title = $crawler->filter('title')->text();
echo $response->getContent();

```

Alternatives / Similar

colly

23,747 compare

pholcus

7,580 compare

playwright

69,451 compare

geziyor

2,667 compare

selenium

31,604 compare

dataflowkit

676 compare

playwright

12,131 compare

undetected-chromedriver

10,683 compare

scrapy

54,211 compare

rvest

1,498 compare

requestium

1,834 compare

chromedp

11,412 compare

gocrawl

2,039 compare

ferret

5,716 compare

scrapyd

2,980 compare

node-crawler

6,733 compare

selenium-driverless

718 compare

autoscraper

6,638 compare

gracy

247 compare

spidr

813 compare

scrapydweb

3,218 compare

gerapy

3,365 compare

wombat

1,316 compare

splash

4,122 compare

ruia

1,754 compare

photon

11,149 compare

ralger

156 compare

roach

1,384 compare

dude

428 compare

ayakashi

213 compare

phpscraper

554 compare

php-spider

1,335 compare

crwlr-crawler

356 compare