Skip to content

html5libvsparsel

MIT License 83 14 1,092
18.9 million (month) Jul 30 2007 1.1(3 years ago)
1,067 8 36 BSD
1.9.0(22 days ago) Jul 26 2019 1.5 million (month)

html5lib is a pure-python library for parsing HTML. It is designed to conform to the WHATWG HTML specification, as is implemented by all major web browsers.

As html5lib is implemented in pure-python it is significantly slower than alternatives powered by lxml (like parsel or beautifulsoup). However, html5lib implements a more true html5 parsing which can represent HTML tree more correctly than alternatives.

parsel is a library for parsing HTML and XML using selectors, similar to beautifulsoup. It is built on top of the lxml library and allows for easy extraction of data from HTML and XML files using selectors, similar to how you would use CSS selectors in web development. It is a light-weight library which is specifically designed for web scraping and parsing, so it is more efficient and faster than beautifulsoup in some use cases.

Some of the key features of parsel include:

  • CSS selector & XPath selector support:
    Two most common html parsing path languages are both supported in parsel. This allows selecting attributes, tags, text and complex matching rules that use regular expressions or XPath functions.
  • Modifying data:
    parsel allows you to modify the contents of an element, remove elements or add new elements to a document.
  • Support for both HTML and XML:
    parsel supports both HTML and XML documents and you can use the same selectors for both formats.

It is easy to use and less verbose than beautifulsoup, so it's quite popular among the developers who are working with Web scraping projects and parse data from large volume of web pages.

Highlights


css-selectorsxpath-selectors

Example Use


import html5lib
from html5lib import parse

html_doc = "<html><head><title>My Title</title></head><body></body></html>"
parsed = parse(html_doc)
title = parsed.getElementsByTagName("title")[0]
print(title.childNodes[0].nodeValue)
from parsel import Selector

# this is our HTML page:
html = """
<head>
  <title>Hello World!</title>
</head>
<body>
  <div id="product">
    <h1>Product Title</h1>
    <p>paragraph 1</p>
    <p>paragraph2</p>
    <span class="price">$10</span>
  </div>
</body>
"""

selector = Selector(html)

# we can use CSS selectors:
selector.css("#product .price::text").get()
"$10"

# or XPath:
selector.xpath('//span[@class="price"]').get()
"$10"

# or get all matching elements:
print(selector.css("#product p::text").getall())
["paragraph 1", "paragraph2"]

# parsel also comes with utility methods like regular expression parsing:
selector.xpath('//span[@class="price"]').re("\d+")
["10"]

Alternatives / Similar