Skip to content

photonvshtml2text

MIT 52 3 10,458
305 (month) Aug 24 2018 1.1.9(5 years ago)
1,595 8 85 GNU GPL 3
2024.2.26(a month ago) Dec 14 2008 1.5 million (month)

Photon is a Python library for web scraping. It is designed to be lightweight and fast, and can be used to extract data from websites and web pages. Photon can extract the following data while crawling:

  • URLs (in-scope & out-of-scope)
  • URLs with parameters (example.com/gallery.php?id=2)
  • Intel (emails, social media accounts, amazon buckets etc.)
  • Files (pdf, png, xml etc.)
  • Secret keys (auth/API keys & hashes)
  • JavaScript files & Endpoints present in them
  • Strings matching custom regex pattern
  • Subdomains & DNS related data

The extracted information is saved in an organized manner or can be exported as json.

html2text is a Python library that allows developers to convert HTML code into plain text. It is designed to be easy to use, and it provides several options to customize the output.

The package uses the python's built-in html.parser to parse the HTML and then convert it to plain text.

html2text also comes with a CLI tool that can convert HTML files to text:

Usage: html2text [filename [encoding]]

Option  Description
--version   Show program's version number and exit
-h, --help  Show this help message and exit
--ignore-links  Don't include any formatting for links
--escape-all    Escape all special characters. Output is less readable, but avoids corner case formatting issues.
--reference-links   Use reference links instead of links to create markdown
--mark-code Mark preformatted and code blocks with [code]...[/code]

Example Use


from photon import Photon

#Create a new Photon instance
ph = Photon()

#Extract data from a specific element of the website
url = "https://www.example.com"
selector = "div.main"
data = ph.get_data(url, selector)

#Print the extracted data
print(data)


#Extract data from multiple websites asynchronously
urls = ["https://www.example1.com", "https://www.example2.com"]
data = ph.get_data_async(urls)
import html2text

h = html2text.HTML2Text()

# Ignore converting links from HTML
h.ignore_links = True
print h.handle("<p>Hello, <a href='https://www.google.com/earth/'>world</a>!")
"Hello, world!"

print(h.handle("<p>Hello, <a href='https://www.google.com/earth/'>world</a>!"))

"Hello, world!"

# Don't Ignore links anymore, I like links
h.ignore_links = False
print(h.handle("<p>Hello, <a href='https://www.google.com/earth/'>world</a>!"))
"Hello, [world](https://www.google.com/earth/)!"

Alternatives / Similar