Skip to content

selectolaxvsparsel

MIT license 14 1 933
74.3 thousand (month) Mar 01 2018 0.3.20(11 days ago)
1,053 8 36 BSD
1.8.1(10 months ago) Jul 26 2019 1.0 million (month)

selectolax is a fast and lightweight library for parsing HTML and XML documents in Python. It is designed to be a drop-in replacement for the popular BeautifulSoup library, with significantly faster performance.

selectolax uses a Cython-based parser to quickly parse and navigate through HTML and XML documents. It provides a simple and intuitive API for working with the document's structure, similar to BeautifulSoup.

To use selectolax, you first need to install it via pip by running pip install selectolax``. Once it is installed, you can use theselectolax.html.fromstring()function to parse an HTML document and create a selectolax object. For example:

from selectolax.parser import HTMLParser

html_string = "<html><body>Hello, World!</body></html>"
root = HTMLParser(html_string).root
print(root.tag) # html
You can also useselectolax.html.fromstring()with file-like objects, bytes or file paths, as well asselectolax.xml.fromstring()`` for parsing XML documents.

Once you have a selectolax object, you can use the select() method to search for elements in the document using CSS selectors, similar to BeautifulSoup. For example:

body = root.select("body")[0]
print(body.text()) # "Hello, World!"

Like BeautifulSoups find and find_all methods selectolax also supports searching using the search()`` method, which returns the first matching element, and thesearch_all()`` method, which returns all matching elements.

parsel is a library for parsing HTML and XML using selectors, similar to beautifulsoup. It is built on top of the lxml library and allows for easy extraction of data from HTML and XML files using selectors, similar to how you would use CSS selectors in web development. It is a light-weight library which is specifically designed for web scraping and parsing, so it is more efficient and faster than beautifulsoup in some use cases.

Some of the key features of parsel include:

  • CSS selector & XPath selector support:
    Two most common html parsing path languages are both supported in parsel. This allows selecting attributes, tags, text and complex matching rules that use regular expressions or XPath functions.
  • Modifying data:
    parsel allows you to modify the contents of an element, remove elements or add new elements to a document.
  • Support for both HTML and XML:
    parsel supports both HTML and XML documents and you can use the same selectors for both formats.

It is easy to use and less verbose than beautifulsoup, so it's quite popular among the developers who are working with Web scraping projects and parse data from large volume of web pages.

Highlights


css-selectorsxpath-selectors

Example Use


from selectolax.parser import HTMLParser

html_string = "<html><body>Hello, World!</body></html>"
root = HTMLParser(html_string).root
print(root.tag) # html

# use css selectors:
body = root.select("body")[0]
print(body.text()) # "Hello, World!"

# find first matching element:
body = root.search("body")
print(body.text()) # "Hello, World!"

# or all matching elements:
html_string = "<html><body><p>paragraph1</p><p>paragraph2</p></body></html>"
root = HTMLParser(html_string).root
for el in root.search_all("p"):
  print(el.text()) 
# will print:
# paragraph 1
# paragraph 2
from parsel import Selector

# this is our HTML page:
html = """
<head>
  <title>Hello World!</title>
</head>
<body>
  <div id="product">
    <h1>Product Title</h1>
    <p>paragraph 1</p>
    <p>paragraph2</p>
    <span class="price">$10</span>
  </div>
</body>
"""

selector = Selector(html)

# we can use CSS selectors:
selector.css("#product .price::text").get()
"$10"

# or XPath:
selector.xpath('//span[@class="price"]').get()
"$10"

# or get all matching elements:
print(selector.css("#product p::text").getall())
["paragraph 1", "paragraph2"]

# parsel also comes with utility methods like regular expression parsing:
selector.xpath('//span[@class="price"]').re("\d+")
["10"]

Alternatives / Similar