newspapervsextruct
newspaper is a Python package that allows developers to easily extract text, images, and videos from articles on the web.
It is designed to be fast, easy to use, and compatible with a wide variety of websites. It uses advanced algorithms to extract relevant information and metadata from articles, and it also supports several languages.
newspaper includes a http client or can ingest pre-scraped HTML documents.
extruct is a library for extracting embedded metadata from HTML markup.
Currently, extruct supports:
- W3C's HTML Microdata
- embedded JSON-LD
- Microformat via mf2py
- Facebook's Open Graph
- (experimental) RDFa via rdflib
- Dublin Core Metadata (DC-HTML-2003)
Extruct is a brilliant data parser for schema.org marked up websites (many modern websites) and is an easy way to extract popular details like product information, company contact details etc.
Example Use
from newspaper import Article
# Create a new article object
article = Article('https://www.example.com/article')
# Download the article
article.download()
# Parse the article
article.parse()
# Print the article text
print(article.text)
# Print the article title
print(article.title)
# Print the article authors
print(article.authors)
# Print the article publication date
print(article.publish_date)
# retrieve HTML content
import httpx
response = httpx.get('https://webscraping.fyi/lib/python/extruct')
import extruct
all_data = extruct.extract(response.text, response.url)
# or we can extract specific metadata format by importing individuals extractors:
extractor = extruct.MicrodataExtractor()
microdata = extractor.extract(response.text)
extractor = extruct.JsonLdExtractor()
jsonld = extractor.extract(response.text)