html5-phpvshtml5lib
HTML5 is a standards-compliant HTML5 parser and writer written entirely in PHP. It is stable and used in many production websites, and has well over five million downloads.
HTML5 provides the following features:
- An HTML5 serializer
- Support for PHP namespaces
- Composer support
- Event-based (SAX-like) parser
- A DOM tree builder
- Interoperability with QueryPath
- Runs on PHP 5.3.0 or newer
Note that html5-php is a low-level HTML parser and does not feature any query features like CSS selectors.
html5lib is a pure-python library for parsing HTML. It is designed to conform to the WHATWG HTML specification, as is implemented by all major web browsers.
As html5lib is implemented in pure-python it is significantly slower than alternatives powered by lxml
(like parsel
or beautifulsoup
).
However, html5lib implements a more true html5 parsing which can represent HTML tree more correctly than alternatives.
Example Use
<?php
// Assuming you installed from Composer:
require "vendor/autoload.php";
use Masterminds\HTML5;
// An example HTML document:
$html = <<< 'HERE'
<html>
<head>
<title>TEST</title>
</head>
<body id='foo'>
<h1>Hello World</h1>
<p>This is a test of the HTML5 parser.</p>
</body>
</html>
HERE;
// Parse the document. $dom is a DOMDocument.
$html5 = new HTML5();
$dom = $html5->loadHTML($html);
// Render it as HTML5:
print $html5->saveHTML($dom);
// Or save it to a file:
$html5->save($dom, 'out.html');
import html5lib
from html5lib import parse
html_doc = "<html><head><title>My Title</title></head><body></body></html>"
parsed = parse(html_doc)
title = parsed.getElementsByTagName("title")[0]
print(title.childNodes[0].nodeValue)