gazpachovsgoquery

MIT 16 1 764

6.6 thousand (month) Dec 28 2012 1.1(4 years ago)

14,273 3 3 BSD-3-Clause

Aug 29 2016 58.1 thousand (month) v1.10.2(6 months ago)

gazpacho is a Python library for scraping web pages. It is designed to make it easy to extract information from a web page by providing a simple and intuitive API for working with the page's structure.

gazpacho uses the requests library to download the page and the lxml library to parse the HTML or XML code. It provides a way to search for elements in the page using CSS selectors, similar to BeautifulSoup.

To use gazpacho, you first need to install it via pip by running pip install gazpacho. Once it is installed, you can use the gazpacho.get() function to download a web page and create a gazpacho object. For example:

from gazpacho import get, Soup

url = "https://en.wikipedia.org/wiki/Web_scraping"
html = get(url)
soup = Soup(html)
print(soup.find('title').text)

You can also use gazpacho.get() with file-like objects, bytes or file paths.

Once you have a gazpacho object, you can use the find() and find_all() methods to search for elements in the page using CSS selectors, similar to BeautifulSoup.

gazpacho also supports searching using the select() method, which returns the first matching element, and the select_all() method, which returns all matching elements.

goquery brings a syntax and a set of features similar to jQuery to the Go language. goquery is a popular and easy-to-use library for Go that allows you to use a CSS selector-like syntax to select elements from an HTML document.

It is based on Go's net/html package and the CSS Selector library cascadia. Since the net/html parser returns nodes, and not a full-featured DOM tree, jQuery's stateful manipulation functions (like height(), css(), detach()) have been left off.

Also, because the net/html parser requires UTF-8 encoding, so does goquery: it is the caller's responsibility to ensure that the source document provides UTF-8 encoded HTML. See the wiki for various options to do this. Syntax-wise, it is as close as possible to jQuery, with the same function names when possible, and that warm and fuzzy chainable interface. jQuery being the ultra-popular library that it is, I felt that writing a similar HTML-manipulating library was better to follow its API than to start anew (in the same spirit as Go's fmt package), even though some of its methods are less than intuitive (looking at you, index()...).

goquery can download HTML by itself (using built-in http client) though it's not recommended for web scraping as it's likely to be blocked.

Example Use

from gazpacho import get, Soup

# gazpacho can retrieve web pages
url = "https://webscraping.fyi/"
html = get(url)
# and parse them:
soup = Soup(html)
print(soup.find('title').text)

# search for elements like beautifulsoup:
body = soup.find("div", {"class":"item"})
print(body.text)

package main

import (
  "fmt"
  "github.com/PuerkitoBio/goquery"
)

func main() {
  // Use goquery.NewDocument to load an HTML document
  // This can load from URL
  doc, err := goquery.NewDocument("http://example.com")
  // or HTML string:
  doc, err := goquery.NewDocumentFromReader("some html")
  if err != nil {
    fmt.Println("Error:", err)
    return
  }

  // Use the Selection.Find method to select elements from the document
  doc.Find("a").Each(func(i int, s *goquery.Selection) {
    // Use the Selection.Text method to get the text of the element
    fmt.Println(s.Text())
    // Use the Selection.Attr method to get the value of an attribute
    fmt.Println(s.Attr("href"))
  })
}