choppervssimple-html-dom

MIT 1 3 23

1.7 thousand (month) Jul 24 2014 0.6.0(2023-04-26 10:16:25 ago)

- - - MIT

Nov 09 2019 1.0 thousand (month) 2.0-RC2(2019-11-09 15:42:50 ago)

Chopper is a tool to extract elements from HTML by preserving ancestors and CSS rules.

Compared to other HTML parsers Chopper is designed to retain original HTML tree but eliminate elements that do not match parsing rules. Meaning, we can parse HTML elements and keep thei structure for machine learning or other tasks where data structure is needed as well as the data value.

Simple HTML DOM Parser is a lightweight PHP library for parsing and manipulating HTML documents using a jQuery-like syntax. It is one of the most widely used HTML parsers in the PHP ecosystem, known for its simplicity and low learning curve.

Key features include:

jQuery-like selectors Find elements using CSS selectors similar to jQuery: find('#id'), find('.class'), find('div > p'), etc.
Easy element access Access element attributes, inner text, and inner HTML with simple property access: $element->plaintext, $element->href, $element->innertext.
DOM manipulation Modify element attributes, text content, and HTML. Add or remove elements from the document tree.
File and string parsing Parse HTML from strings, local files, or URLs directly.
Nested element traversal Navigate parent, child, and sibling elements with built-in traversal methods.
Memory-friendly Designed to handle large HTML documents efficiently with explicit memory cleanup.

Simple HTML DOM is best suited for quick scraping tasks and small projects where simplicity is more important than performance. For large-scale or performance-critical applications, consider using PHP's built-in DOMDocument or Symfony's DomCrawler which offer better performance and standards compliance.

Highlights

css-selectorsdsl-selectors

Example Use

```python HTML = """ Test

HELLO WORLD Do not want

<div id="footer"></div>

"""

CSS = """ div { border: 1px solid black; } div#main { color: blue; } div.iwantthis { background-color: red; } a { color: green; } div#footer { border-top: 2px solid red; } """

extractor = Extractor.keep('//div[@class="iwantthis"]').discard('//a') html, css = extractor.extract(HTML, CSS)

will result in:

html """

HELLO WORLD

"""

css """ div{border:1px solid black;} div#main{color:blue;} div.iwantthis{background-color:red;} """ ```

```php

load('https://example.com'); // Or parse from a string $html = new HtmlDocument('

Hello World

'); // Find elements using CSS-like selectors $title = $html->find('title', 0)->plaintext; echo "Title: $title\n"; // Find all products foreach ($html->find('.product') as $product) { $name = $product->find('.name', 0)->plaintext; $price = $product->find('.price', 0)->plaintext; $link = $product->find('a', 0)->href; echo "$name: $price ($link)\n"; } // Modify elements $html->find('h1', 0)->innertext = 'Modified Title'; // Clean up memory when done $html->clear(); ```