Skip to content

phpscrapervsphoton

GPL-3.0-or-later 22 2 498
105 (month) May 04 2020 3.0.0(a month ago)
10,575 3 52 MIT
Aug 24 2018 335 (month) 1.1.9(5 years ago)

PHPScraper is a universal web-util for PHP. The main goal is to get stuff done instead of getting distracted with selectors, preparing & converting data structures, etc. Instead, you can just go to a website and get the relevant information for your project.

PHPScraper is a minimalistic scraper framework that is built on top of other popular scraping tools.

Features:

  • Direct access to page basic features like: Meta data, Links, Images, Headings, Content, Keywords etc.
  • File downloading.
  • RSS, Sitemap and other feed processing.
  • CSV, XML and JSON file processing.

Photon is a Python library for web scraping. It is designed to be lightweight and fast, and can be used to extract data from websites and web pages. Photon can extract the following data while crawling:

  • URLs (in-scope & out-of-scope)
  • URLs with parameters (example.com/gallery.php?id=2)
  • Intel (emails, social media accounts, amazon buckets etc.)
  • Files (pdf, png, xml etc.)
  • Secret keys (auth/API keys & hashes)
  • JavaScript files & Endpoints present in them
  • Strings matching custom regex pattern
  • Subdomains & DNS related data

The extracted information is saved in an organized manner or can be exported as json.

Example Use


// create scraper object
$web = new \Spekulatius\PHPScraper\PHPScraper;
// go to URL
$web->go('https://test-pages.phpscraper.de/content/selectors.html');

// elements can be found using XPath:
echo $web->filter("//*[@id='by-id']")->text();   // "Content by ID"

// or pre-defined variables covering basic page data:
$web->links;  // for all links
$web->headings;
$web->images;
$web->contentKeywords;
$web->orderedLists;
$web->unorderedLists;
$web->paragraphs;
$web->outline;  // basic page outline
$web->cleanOutlineWithParagraphs;  // basic page outline
from photon import Photon

#Create a new Photon instance
ph = Photon()

#Extract data from a specific element of the website
url = "https://www.example.com"
selector = "div.main"
data = ph.get_data(url, selector)

#Print the extracted data
print(data)


#Extract data from multiple websites asynchronously
urls = ["https://www.example1.com", "https://www.example2.com"]
data = ph.get_data_async(urls)

Alternatives / Similar


Was this page helpful?