choppervscssselect
Chopper is a tool to extract elements from HTML by preserving ancestors and CSS rules.
Compared to other HTML parsers Chopper is designed to retain original HTML tree but eliminate elements that do not match parsing rules. Meaning, we can parse HTML elements and keep thei structure for machine learning or other tasks where data structure is needed as well as the data value.
cssselect is a BSD-licensed Python library to parse CSS3 selectors and translate them to XPath 1.0 expressions.
XPath 1.0 expressions can be used in lxml or another XPath engine to find the matching elements in an XML or HTML document.
cssselect is used by other popular Python packages like parsel
and scrapy
but can also be used on it's own to generate
valid XPath 1.0 expressions for parsing HTML and XML documents in other tools.
Note that because XPath selectors are more powerful than CSS selectors this translation is only possible one way. Converting XPath to CSS selectors is impractical and not supported by cssselect.
Example Use
HTML = """
<html>
<head>
<title>Test</title>
</head>
<body>
<div id="header"></div>
<div id="main">
<div class="iwantthis">
HELLO WORLD
<a href="/nope">Do not want</a>
</div>
</div>
<div id="footer"></div>
</body>
</html>
"""
CSS = """
div { border: 1px solid black; }
div#main { color: blue; }
div.iwantthis { background-color: red; }
a { color: green; }
div#footer { border-top: 2px solid red; }
"""
extractor = Extractor.keep('//div[@class="iwantthis"]').discard('//a')
html, css = extractor.extract(HTML, CSS)
# will result in:
html
"""
<html>
<body>
<div id="main">
<div class="iwantthis">
HELLO WORLD
</div>
</div>
</body>
</html>"""
css
"""
div{border:1px solid black;}
div#main{color:blue;}
div.iwantthis{background-color:red;}
"""
from cssselect import GenericTranslator, SelectorError
translator = GenericTranslator()
try:
expression = translator.css_to_xpath('div.content')
print(expression)
'descendant-or-self::div[@class and contains(concat(' ', normalize-space(@class), ' '), ' content ')]'
except SelectorError as e:
print(f'Invalid selector {e}')