extructvsreadability
extruct is a library for extracting embedded metadata from HTML markup.
Currently, extruct supports:
- W3C's HTML Microdata
- embedded JSON-LD
- Microformat via mf2py
- Facebook's Open Graph
- (experimental) RDFa via rdflib
- Dublin Core Metadata (DC-HTML-2003)
Extruct is a brilliant data parser for schema.org marked up websites (many modern websites) and is an easy way to extract popular details like product information, company contact details etc.
python-readability is a python package that allows developers to extract the main content of a web page, removing any unnecessary or unwanted elements, such as ads, navigation, and sidebars.
It is based on the algorithm used by the popular web-based service, Readability, and it uses the beautifulsoup4 package to parse the HTML and extract the main content.
Readability is similar to Newspaper in terms that it's extracting HTML data
Example Use
# retrieve HTML content
import httpx
response = httpx.get('https://webscraping.fyi/lib/python/extruct')
import extruct
all_data = extruct.extract(response.text, response.url)
# or we can extract specific metadata format by importing individuals extractors:
extractor = extruct.MicrodataExtractor()
microdata = extractor.extract(response.text)
extractor = extruct.JsonLdExtractor()
jsonld = extractor.extract(response.text)
import requests
from readability import document
response = requests.get('http://example.com')
doc = document(response.content)
doc.title()
'example domain'
doc.summary()
"""<html><body><div><body id="readabilitybody">\n<div>\n <h1>example domain</h1>\n
<p>this domain is established to be used for illustrative examples in documents. you may
use this\n domain in examples without prior coordination or asking for permission.</p>
\n <p><a href="http://www.iana.org/domains/example">more information...</a></p>\n</div>
\n</body>\n</div></body></html>"""