Skip to content

selectolaxvshtml5-php

MIT license 15 1 937
79.7 thousand (month) Mar 01 2018 0.3.20(19 days ago)
1,431 5 36 MIT
2.8.1(9 months ago) Jun 01 2013 147.7 thousand (month)

selectolax is a fast and lightweight library for parsing HTML and XML documents in Python. It is designed to be a drop-in replacement for the popular BeautifulSoup library, with significantly faster performance.

selectolax uses a Cython-based parser to quickly parse and navigate through HTML and XML documents. It provides a simple and intuitive API for working with the document's structure, similar to BeautifulSoup.

To use selectolax, you first need to install it via pip by running pip install selectolax``. Once it is installed, you can use theselectolax.html.fromstring()function to parse an HTML document and create a selectolax object. For example:

from selectolax.parser import HTMLParser

html_string = "<html><body>Hello, World!</body></html>"
root = HTMLParser(html_string).root
print(root.tag) # html
You can also useselectolax.html.fromstring()with file-like objects, bytes or file paths, as well asselectolax.xml.fromstring()`` for parsing XML documents.

Once you have a selectolax object, you can use the select() method to search for elements in the document using CSS selectors, similar to BeautifulSoup. For example:

body = root.select("body")[0]
print(body.text()) # "Hello, World!"

Like BeautifulSoups find and find_all methods selectolax also supports searching using the search()`` method, which returns the first matching element, and thesearch_all()`` method, which returns all matching elements.

HTML5 is a standards-compliant HTML5 parser and writer written entirely in PHP. It is stable and used in many production websites, and has well over five million downloads.

HTML5 provides the following features:

  • An HTML5 serializer
  • Support for PHP namespaces
  • Composer support
  • Event-based (SAX-like) parser
  • A DOM tree builder
  • Interoperability with QueryPath
  • Runs on PHP 5.3.0 or newer

Note that html5-php is a low-level HTML parser and does not feature any query features like CSS selectors.

Example Use


from selectolax.parser import HTMLParser

html_string = "<html><body>Hello, World!</body></html>"
root = HTMLParser(html_string).root
print(root.tag) # html

# use css selectors:
body = root.select("body")[0]
print(body.text()) # "Hello, World!"

# find first matching element:
body = root.search("body")
print(body.text()) # "Hello, World!"

# or all matching elements:
html_string = "<html><body><p>paragraph1</p><p>paragraph2</p></body></html>"
root = HTMLParser(html_string).root
for el in root.search_all("p"):
  print(el.text()) 
# will print:
# paragraph 1
# paragraph 2
<?php
// Assuming you installed from Composer:
require "vendor/autoload.php";

use Masterminds\HTML5;

// An example HTML document:
$html = <<< 'HERE'
  <html>
  <head>
    <title>TEST</title>
  </head>
  <body id='foo'>
    <h1>Hello World</h1>
    <p>This is a test of the HTML5 parser.</p>
  </body>
  </html>
HERE;

// Parse the document. $dom is a DOMDocument.
$html5 = new HTML5();
$dom = $html5->loadHTML($html);

// Render it as HTML5:
print $html5->saveHTML($dom);

// Or save it to a file:
$html5->save($dom, 'out.html');

Alternatives / Similar