gazpachovscssselect

MIT 16 1 764

6.6 thousand (month) Dec 28 2012 1.1(4 years ago)

293 8 21 NOASSERTION

Apr 14 2012 7.5 million (month) 1.2.0(2 years ago)

gazpacho is a Python library for scraping web pages. It is designed to make it easy to extract information from a web page by providing a simple and intuitive API for working with the page's structure.

gazpacho uses the requests library to download the page and the lxml library to parse the HTML or XML code. It provides a way to search for elements in the page using CSS selectors, similar to BeautifulSoup.

To use gazpacho, you first need to install it via pip by running pip install gazpacho. Once it is installed, you can use the gazpacho.get() function to download a web page and create a gazpacho object. For example:

from gazpacho import get, Soup

url = "https://en.wikipedia.org/wiki/Web_scraping"
html = get(url)
soup = Soup(html)
print(soup.find('title').text)

You can also use gazpacho.get() with file-like objects, bytes or file paths.

Once you have a gazpacho object, you can use the find() and find_all() methods to search for elements in the page using CSS selectors, similar to BeautifulSoup.

gazpacho also supports searching using the select() method, which returns the first matching element, and the select_all() method, which returns all matching elements.

cssselect is a BSD-licensed Python library to parse CSS3 selectors and translate them to XPath 1.0 expressions.

XPath 1.0 expressions can be used in lxml or another XPath engine to find the matching elements in an XML or HTML document.

cssselect is used by other popular Python packages like parsel and scrapy but can also be used on it's own to generate valid XPath 1.0 expressions for parsing HTML and XML documents in other tools.

Note that because XPath selectors are more powerful than CSS selectors this translation is only possible one way. Converting XPath to CSS selectors is impractical and not supported by cssselect.

Example Use

from gazpacho import get, Soup

# gazpacho can retrieve web pages
url = "https://webscraping.fyi/"
html = get(url)
# and parse them:
soup = Soup(html)
print(soup.find('title').text)

# search for elements like beautifulsoup:
body = soup.find("div", {"class":"item"})
print(body.text)

from cssselect import GenericTranslator, SelectorError

translator = GenericTranslator()
try:
    expression = translator.css_to_xpath('div.content')
    print(expression)
    'descendant-or-self::div[@class and contains(concat(' ', normalize-space(@class), ' '), ' content ')]'
except SelectorError as e:
    print(f'Invalid selector {e}')