Skip to content

html5-phpvssoup

MIT 34 5 1,638
200.0 thousand (month) Jun 01 2013 2.9.0(1 year, 4 months ago)
2,191 1 22 MIT
Apr 29 2017 58.1 thousand (month) v1.2.5(3 years ago)

HTML5 is a standards-compliant HTML5 parser and writer written entirely in PHP. It is stable and used in many production websites, and has well over five million downloads.

HTML5 provides the following features:

  • An HTML5 serializer
  • Support for PHP namespaces
  • Composer support
  • Event-based (SAX-like) parser
  • A DOM tree builder
  • Interoperability with QueryPath
  • Runs on PHP 5.3.0 or newer

Note that html5-php is a low-level HTML parser and does not feature any query features like CSS selectors.

soup is a Go library for parsing and querying HTML documents.

It provides a simple and intuitive interface for extracting information from HTML pages. It's inspired by popular Python web scraping library BeautifulSoup and shares similar use API implementing functions like Find and FindAll.

soup can also use go's built-in http client to download HTML content.

Note that unlike beautifulsoup, soup does not support CSS selectors or XPath.

Example Use


<?php
// Assuming you installed from Composer:
require "vendor/autoload.php";

use Masterminds\HTML5;

// An example HTML document:
$html = <<< 'HERE'
  <html>
  <head>
    <title>TEST</title>
  </head>
  <body id='foo'>
    <h1>Hello World</h1>
    <p>This is a test of the HTML5 parser.</p>
  </body>
  </html>
HERE;

// Parse the document. $dom is a DOMDocument.
$html5 = new HTML5();
$dom = $html5->loadHTML($html);

// Render it as HTML5:
print $html5->saveHTML($dom);

// Or save it to a file:
$html5->save($dom, 'out.html');
package main

import (
  "fmt"
  "log"

  "github.com/anaskhan96/soup"
)

func main() {

  url := "https://www.bing.com/search?q=weather+Toronto"

  # soup has basic HTTP client though it's not recommended for scraping:
  resp, err := soup.Get(url)
  if err != nil {
    log.Fatal(err)
  }

  # create soup object from HTML
  doc := soup.HTMLParse(resp)

  # html elements can be found using Find or FindStrict methods:
  # in this case find <div> elements where "class" attribute matches some values:
  grid := doc.FindStrict("div", "class", "b_antiTopBleed b_antiSideBleed b_antiBottomBleed")
  # note: to find all elements FindAll() method can be used the same way

  # elements can be further searched for descendents:
  heading := grid.Find("div", "class", "wtr_titleCtrn").Find("div").Text()
  conditions := grid.Find("div", "class", "wtr_condition")
  primaryCondition := conditions.Find("div")
  secondaryCondition := primaryCondition.FindNextElementSibling()
  temp := primaryCondition.Find("div", "class", "wtr_condiTemp").Find("div").Text()
  others := primaryCondition.Find("div", "class", "wtr_condiAttribs").FindAll("div")
  caption := secondaryCondition.Find("div").Text()

  fmt.Println("City Name : " + heading)
  fmt.Println("Temperature : " + temp + "˚C")
  for _, i := range others {
    fmt.Println(i.Text())
  }
  fmt.Println(caption)
}

Alternatives / Similar


Was this page helpful?