Skip to content

html5libvsrequests-html

MIT 83 14 1,104
15.0 million (month) Jul 30 2007 1.1(4 years ago)
13,615 2 222 MIT
Feb 25 2018 1.1 million (month) 0.10.0(5 years ago)

html5lib is a pure-python library for parsing HTML. It is designed to conform to the WHATWG HTML specification, as is implemented by all major web browsers.

As html5lib is implemented in pure-python it is significantly slower than alternatives powered by lxml (like parsel or beautifulsoup). However, html5lib implements a more true html5 parsing which can represent HTML tree more correctly than alternatives.

requests-html is a Python package that allows you to easily make HTTP requests and parse the HTML content of web pages. It is built on top of the popular requests package and uses the html parser from the lxml library, which makes it fast and efficient. This package is designed to provide a simple and convenient API for web scraping, and it supports features such as JavaScript rendering, CSS selectors, and form submissions.

It also offers a lot of functionalities such as cookie, session, and proxy support, which makes it an easy-to-use package for web scraping and web automation tasks.

In short requests-html offers:

  • Full JavaScript support!
  • CSS Selectors (a.k.a jQuery-style, thanks to PyQuery).
  • XPath Selectors, for the faint of heart.
  • Mocked user-agent (like a real web browser).
  • Automatic following of redirects.
  • Connection–pooling and cookie persistence.
  • The Requests experience you know and love, with magical parsing abilities.
  • Async Support

Example Use


import html5lib
from html5lib import parse

html_doc = "<html><head><title>My Title</title></head><body></body></html>"
parsed = parse(html_doc)
title = parsed.getElementsByTagName("title")[0]
print(title.childNodes[0].nodeValue)
from requests_html import HTMLSession

session = HTMLSession()
r = session.get('https://www.example.com')

# print the HTML content of the page
print(r.html.html)

# use CSS selectors to find specific elements on the page
title = r.html.find('title', first=True)
print(title.text)

Alternatives / Similar


Was this page helpful?