Skip to content

scrapyvspanther

BSD 652 30 50,703
1.6 million (month) Jul 26 2019 2.11.1(a month ago)
2,878 4 195 MIT
v2.1.1(4 months ago) Jul 17 2018 8.9 thousand (month)

Scrapy is an open-source Python library for web scraping. It allows developers to extract structured data from websites using a simple and consistent interface.

Scrapy provides:

  • A built-in way to follow links and extract data from multiple pages (crawling)
  • Handling common web scraping tasks such as logging in, handling cookies, and handling redirects.

Scrapy is built on top of the Twisted networking engine, which provides a non-blocking way to handle multiple requests at the same time, allowing Scrapy to efficiently scrape large websites.

It also comes with a built-in mechanism for handling common web scraping problems, such as:

  • handling HTTP errors
  • handling broken links

Scrapy also provide these features:

  • Support for storing scraped data in various formats, such as CSV, JSON, and XML.
  • Built-in support for selecting and extracting data using XPath or CSS selectors (through parsel).
  • Built-in support for handling common web scraping problems (like deduplication and url filtering).
  • Ability to easily extend its functionality using middlewares.
  • Ability to easily extend output processing using pipelines.

Panther is a convenient standalone library to scrape websites and to run end-to-end tests using real browsers.

Panther is super powerful. It leverages the W3C's WebDriver protocol to drive native web browsers such as Google Chrome and Firefox.

Panther is very easy to use, because it implements Symfony's popular BrowserKit and DomCrawler APIs, and contains all the features you need to test your apps. It will sound familiar if you have ever created a functional test for a Symfony app: as the API is exactly the same! Keep in mind that Panther can be used in every PHP project, as it is a standalone library.

Panther automatically finds your local installation of Chrome or Firefox and launches them, so you don't need to install anything else on your computer, a Selenium server is not needed!

In test mode, Panther automatically starts your application using the PHP built-in web-server. You can focus on writing your tests or web-scraping scenario and Panther will take care of everything else.

Features:

  • executes the JavaScript code contained in webpages
  • supports everything that Chrome (or Firefox) implements
  • allows taking screenshots
  • can wait for asynchronously loaded elements to show up
  • lets you run your own JS code or XPath queries in the context of the loaded page
  • supports custom Selenium server installations
  • supports remote browser testing services including SauceLabs and BrowserStack

Highlights


popularcss-selectorsxpath-selectorscommunity-toolsoutput-pipelinesmiddlewaresasyncproductionlarge-scale

Example Use


<?php

use Symfony\Component\Panther\Client;

require __DIR__.'/vendor/autoload.php'; // Composer's autoloader

$client = Client::createChromeClient();
// Or, if you care about the open web and prefer to use Firefox
$client = Client::createFirefoxClient();

$client->request('GET', 'https://api-platform.com'); // Yes, this website is 100% written in JavaScript
$client->clickLink('Get started');

// Wait for an element to be present in the DOM (even if hidden)
$crawler = $client->waitFor('#installing-the-framework');
// Alternatively, wait for an element to be visible
$crawler = $client->waitForVisibility('#installing-the-framework');

echo $crawler->filter('#installing-the-framework')->text();
$client->takeScreenshot('screen.png'); // Yeah, screenshot!

Alternatives / Similar