extructvsextractnet

python html-extractor

BSD-3-Clause 56 12 961

273.1 thousand (month) Oct 27 2015 0.18.0(2024-11-08 14:59:22 ago)

297 9 9 MIT

Dec 11 2020 131 (month) 2.0.7(2022-11-06 07:33:14 ago)

extruct is a library for extracting embedded metadata from HTML markup.

Currently, extruct supports:

W3C's HTML Microdata
embedded JSON-LD
Microformat via mf2py
Facebook's Open Graph
(experimental) RDFa via rdflib
Dublin Core Metadata (DC-HTML-2003)

Extruct is a brilliant data parser for schema.org marked up websites (many modern websites) and is an easy way to extract popular details like product information, company contact details etc.

ExtractNet is an automated web data extraction tool using machine learning to parse HTML and text data.

This tool can be used in web scraping to automatically extract details from scraped HTML documents. While it's not as accurate as structured extraction using HTML parsing tools like CSS selectors or XPath it can still parse a lot of details.

Example Use

```python # retrieve HTML content import httpx response = httpx.get('https://webscraping.fyi/lib/python/extruct') import extruct all_data = extruct.extract(response.text, response.url) # or we can extract specific metadata format by importing individuals extractors: extractor = extruct.MicrodataExtractor() microdata = extractor.extract(response.text) extractor = extruct.JsonLdExtractor() jsonld = extractor.extract(response.text) ```

```python import requests from extractnet import Extractor raw_html = requests.get('https://currentsapi.services/en/blog/2019/03/27/python-microframework-benchmark/.html').text results = Extractor().extract(raw_html) {'phone_number': '555-555-5555', 'email': 'example@example.com'} ```

extructvsextractnet

Example Use

Alternatives / Similar