photonvsreadability

GPL-3.0 51 3 11,149

415 (month) Aug 24 2018 1.1.9(6 years ago)

2,724 5 34 Apache-2.0

Jun 30 2011 270.7 thousand (month) 0.8.1(4 years ago)

Photon is a Python library for web scraping. It is designed to be lightweight and fast, and can be used to extract data from websites and web pages. Photon can extract the following data while crawling:

URLs (in-scope & out-of-scope)
URLs with parameters (example.com/gallery.php?id=2)
Intel (emails, social media accounts, amazon buckets etc.)
Files (pdf, png, xml etc.)
Secret keys (auth/API keys & hashes)
JavaScript files & Endpoints present in them
Strings matching custom regex pattern
Subdomains & DNS related data

The extracted information is saved in an organized manner or can be exported as json.

python-readability is a python package that allows developers to extract the main content of a web page, removing any unnecessary or unwanted elements, such as ads, navigation, and sidebars.

It is based on the algorithm used by the popular web-based service, Readability, and it uses the beautifulsoup4 package to parse the HTML and extract the main content.

Readability is similar to Newspaper in terms that it's extracting HTML data

Example Use

from photon import Photon

#Create a new Photon instance
ph = Photon()

#Extract data from a specific element of the website
url = "https://www.example.com"
selector = "div.main"
data = ph.get_data(url, selector)

#Print the extracted data
print(data)


#Extract data from multiple websites asynchronously
urls = ["https://www.example1.com", "https://www.example2.com"]
data = ph.get_data_async(urls)

import requests
from readability import document

response = requests.get('http://example.com')
doc = document(response.content)
doc.title()
'example domain'

doc.summary()
"""<html><body><div><body id="readabilitybody">\n<div>\n    <h1>example domain</h1>\n
<p>this domain is established to be used for illustrative examples in documents. you may
use this\n    domain in examples without prior coordination or asking for permission.</p>
\n    <p><a href="http://www.iana.org/domains/example">more information...</a></p>\n</div>
\n</body>\n</div></body></html>"""