Skip to content

photonvsextruct

GPL-3.0 51 3 11,149
415 (month) Aug 24 2018 1.1.9(6 years ago)
884 12 52 BSD-3-Clause
Oct 27 2015 698.9 thousand (month) 0.18.0(9 months ago)

Photon is a Python library for web scraping. It is designed to be lightweight and fast, and can be used to extract data from websites and web pages. Photon can extract the following data while crawling:

  • URLs (in-scope & out-of-scope)
  • URLs with parameters (example.com/gallery.php?id=2)
  • Intel (emails, social media accounts, amazon buckets etc.)
  • Files (pdf, png, xml etc.)
  • Secret keys (auth/API keys & hashes)
  • JavaScript files & Endpoints present in them
  • Strings matching custom regex pattern
  • Subdomains & DNS related data

The extracted information is saved in an organized manner or can be exported as json.

extruct is a library for extracting embedded metadata from HTML markup.

Currently, extruct supports:

  • W3C's HTML Microdata
  • embedded JSON-LD
  • Microformat via mf2py
  • Facebook's Open Graph
  • (experimental) RDFa via rdflib
  • Dublin Core Metadata (DC-HTML-2003)

Extruct is a brilliant data parser for schema.org marked up websites (many modern websites) and is an easy way to extract popular details like product information, company contact details etc.

Example Use


from photon import Photon

#Create a new Photon instance
ph = Photon()

#Extract data from a specific element of the website
url = "https://www.example.com"
selector = "div.main"
data = ph.get_data(url, selector)

#Print the extracted data
print(data)


#Extract data from multiple websites asynchronously
urls = ["https://www.example1.com", "https://www.example2.com"]
data = ph.get_data_async(urls)
# retrieve HTML content
import httpx

response = httpx.get('https://webscraping.fyi/lib/python/extruct')

import extruct

all_data = extruct.extract(response.text, response.url)

# or we can extract specific metadata format by importing individuals extractors:


extractor = extruct.MicrodataExtractor()
microdata = extractor.extract(response.text)

extractor = extruct.JsonLdExtractor()
jsonld = extractor.extract(response.text) 

Alternatives / Similar


Was this page helpful?