html5-phpvscssselect
HTML5 is a standards-compliant HTML5 parser and writer written entirely in PHP. It is stable and used in many production websites, and has well over five million downloads.
HTML5 provides the following features:
- An HTML5 serializer
- Support for PHP namespaces
- Composer support
- Event-based (SAX-like) parser
- A DOM tree builder
- Interoperability with QueryPath
- Runs on PHP 5.3.0 or newer
Note that html5-php is a low-level HTML parser and does not feature any query features like CSS selectors.
cssselect is a BSD-licensed Python library to parse CSS3 selectors and translate them to XPath 1.0 expressions.
XPath 1.0 expressions can be used in lxml or another XPath engine to find the matching elements in an XML or HTML document.
cssselect is used by other popular Python packages like parsel
and scrapy
but can also be used on it's own to generate
valid XPath 1.0 expressions for parsing HTML and XML documents in other tools.
Note that because XPath selectors are more powerful than CSS selectors this translation is only possible one way. Converting XPath to CSS selectors is impractical and not supported by cssselect.
Example Use
<?php
// Assuming you installed from Composer:
require "vendor/autoload.php";
use Masterminds\HTML5;
// An example HTML document:
$html = <<< 'HERE'
<html>
<head>
<title>TEST</title>
</head>
<body id='foo'>
<h1>Hello World</h1>
<p>This is a test of the HTML5 parser.</p>
</body>
</html>
HERE;
// Parse the document. $dom is a DOMDocument.
$html5 = new HTML5();
$dom = $html5->loadHTML($html);
// Render it as HTML5:
print $html5->saveHTML($dom);
// Or save it to a file:
$html5->save($dom, 'out.html');
from cssselect import GenericTranslator, SelectorError
translator = GenericTranslator()
try:
expression = translator.css_to_xpath('div.content')
print(expression)
'descendant-or-self::div[@class and contains(concat(' ', normalize-space(@class), ' '), ' content ')]'
except SelectorError as e:
print(f'Invalid selector {e}')