Skip to content

choppervsdomcrawler

MIT 1 3 22
820 (month) Jul 24 2014 0.6.0(11 months ago)
3,892 8 - MIT
v7.0.4(a month ago) Sep 26 2011 169.8 thousand (month)

Chopper is a tool to extract elements from HTML by preserving ancestors and CSS rules.

Compared to other HTML parsers Chopper is designed to retain original HTML tree but eliminate elements that do not match parsing rules. Meaning, we can parse HTML elements and keep thei structure for machine learning or other tasks where data structure is needed as well as the data value.

DOMCrawler library is part of the Symfony Components project and provides an easy way to traverse and manipulate HTML and XML documents using the Document Object Model (DOM) in PHP.

DOMcrawler supports both CSS selectors and XPath for HTML document parsing and is one the most popular HTML parsing tools used in web scraping with PHP.

Example Use


HTML = """
<html>
  <head>
    <title>Test</title>
  </head>
  <body>
    <div id="header"></div>
    <div id="main">
      <div class="iwantthis">
        HELLO WORLD
        <a href="/nope">Do not want</a>
      </div>
    </div>
    <div id="footer"></div>
  </body>
</html>
"""

CSS = """
div { border: 1px solid black; }
div#main { color: blue; }
div.iwantthis { background-color: red; }
a { color: green; }
div#footer { border-top: 2px solid red; }
"""

extractor = Extractor.keep('//div[@class="iwantthis"]').discard('//a')
html, css = extractor.extract(HTML, CSS)

# will result in:
html
"""
<html>
  <body>
    <div id="main">
      <div class="iwantthis">
        HELLO WORLD
      </div>
    </div>
  </body>
</html>"""

css
"""
div{border:1px solid black;}
div#main{color:blue;}
div.iwantthis{background-color:red;}
"""
use Symfony\Component\DomCrawler\Crawler;

$html = '<html><body><h1 class="title">Hello World</h1></body></html>';
$crawler = new Crawler($html);

// Find all elements using CSS selectors
$elements = $crawler->filter('.title')i;
// or XPath
$elements = $crawler->filterXPath('//h1');

// Print the text content of the elements
foreach ($elements as $element) {
    echo $element->textContent;
}

Alternatives / Similar