newspapervsphoton
newspaper is a Python package that allows developers to easily extract text, images, and videos from articles on the web.
It is designed to be fast, easy to use, and compatible with a wide variety of websites. It uses advanced algorithms to extract relevant information and metadata from articles, and it also supports several languages.
newspaper includes a http client or can ingest pre-scraped HTML documents.
Photon is a Python library for web scraping. It is designed to be lightweight and fast, and can be used to extract data from websites and web pages. Photon can extract the following data while crawling:
- URLs (in-scope & out-of-scope)
- URLs with parameters (example.com/gallery.php?id=2)
- Intel (emails, social media accounts, amazon buckets etc.)
- Files (pdf, png, xml etc.)
- Secret keys (auth/API keys & hashes)
- JavaScript files & Endpoints present in them
- Strings matching custom regex pattern
- Subdomains & DNS related data
The extracted information is saved in an organized manner or can be exported as json.
Example Use
from newspaper import Article
# Create a new article object
article = Article('https://www.example.com/article')
# Download the article
article.download()
# Parse the article
article.parse()
# Print the article text
print(article.text)
# Print the article title
print(article.title)
# Print the article authors
print(article.authors)
# Print the article publication date
print(article.publish_date)
from photon import Photon
#Create a new Photon instance
ph = Photon()
#Extract data from a specific element of the website
url = "https://www.example.com"
selector = "div.main"
data = ph.get_data(url, selector)
#Print the extracted data
print(data)
#Extract data from multiple websites asynchronously
urls = ["https://www.example1.com", "https://www.example2.com"]
data = ph.get_data_async(urls)