photonvsphpscraper

GPL-3.0 61 3 12,807

1.4 thousand (month) Aug 24 2018 1.1.9(2018-10-21 03:39:17 ago)

583 2 28 GPL-3.0-or-later

May 04 2020 104 (month) 3.0.0(2024-04-09 15:34:59 ago)

Photon is a Python library for web scraping. It is designed to be lightweight and fast, and can be used to extract data from websites and web pages. Photon can extract the following data while crawling:

URLs (in-scope & out-of-scope)
URLs with parameters (example.com/gallery.php?id=2)
Intel (emails, social media accounts, amazon buckets etc.)
Files (pdf, png, xml etc.)
Secret keys (auth/API keys & hashes)
JavaScript files & Endpoints present in them
Strings matching custom regex pattern
Subdomains & DNS related data

The extracted information is saved in an organized manner or can be exported as json.

PHPScraper is a universal web-util for PHP. The main goal is to get stuff done instead of getting distracted with selectors, preparing & converting data structures, etc. Instead, you can just go to a website and get the relevant information for your project.

PHPScraper is a minimalistic scraper framework that is built on top of other popular scraping tools.

Features:

Direct access to page basic features like: Meta data, Links, Images, Headings, Content, Keywords etc.
File downloading.
RSS, Sitemap and other feed processing.
CSV, XML and JSON file processing.

Example Use

```python from photon import Photon #Create a new Photon instance ph = Photon() #Extract data from a specific element of the website url = "https://www.example.com" selector = "div.main" data = ph.get_data(url, selector) #Print the extracted data print(data) #Extract data from multiple websites asynchronously urls = ["https://www.example1.com", "https://www.example2.com"] data = ph.get_data_async(urls) ```

```javascript // create scraper object $web = new \Spekulatius\PHPScraper\PHPScraper; // go to URL $web->go('https://test-pages.phpscraper.de/content/selectors.html'); // elements can be found using XPath: echo $web->filter("//*[@id='by-id']")->text(); // "Content by ID" // or pre-defined variables covering basic page data: $web->links; // for all links $web->headings; $web->images; $web->contentKeywords; $web->orderedLists; $web->unorderedLists; $web->paragraphs; $web->outline; // basic page outline $web->cleanOutlineWithParagraphs; // basic page outline ```

Alternatives / Similar

colly

25,231 compare

katana new

16,499 compare

pholcus

7,594 compare

geziyor

2,772 compare

html2text

2,140 compare

dataflowkit

711 compare

trafilatura

5,650 compare

scrapy

61,276 compare

readability

2,894 compare

crawl4ai new

63,373 compare

newspaper

15,018 compare

rvest

1,517 compare

scrapling new

36,206 compare

crawlee new

22,720 compare

extruct

961 compare

mechanize new

4,440 compare

sumy

3,670 compare

scrapegraphai new

23,278 compare

gofeed

2,824 compare

ferret

5,964 compare

gocrawl

2,053 compare

scrapyd

3,087 compare

botasaurus new

4,321 compare

node-crawler

6,790 compare

panther

3,062 compare

goutte new

9,215 compare

gracy

248 compare

spidr

835 compare

kimurai new

1,098 compare

scrapydweb

3,400 compare

wombat

1,360 compare

autoscraper

7,136 compare

roach

1,454 compare

gerapy

3,495 compare

ruia

1,743 compare

ralger

165 compare

ayakashi

217 compare

extractnet

297 compare

phpscraper

583 compare

dude

425 compare

php-spider

1,341 compare

crwlr-crawler

369 compare

firecrawl new

- compare