ferretvsdude
Ferret is a web scraping system. It aims to simplify data extraction from the web for UI testing, machine learning, analytics and more. ferret allows users to focus on the data. It abstracts away the technical details and complexity of underlying technologies using its own declarative language. It is extremely portable, extensible, and fast.
Features
- Declarative language
- Support of both static and dynamic web pages
- Embeddable
- Extensible
Ferret is always implemented in Python through pyfer
Dude (dude uncomplicated data extraction) is a very simple framework for writing web scrapers using Python decorators. The design, inspired by Flask, was to easily build a web scraper in just a few lines of code. Dude has an easy-to-learn syntax.
The simplest web scraper will look like this:
from dude import select
@select(css="a")
def get_link(element):
return {"url": element.get_attribute("href")}
dude supports multiple parser backends:
- playwright
- lxml
- parsel
- beautifulsoup
- pyppeteer
- selenium
Example Use
// Example scraper for Google in Ferret:
LET google = DOCUMENT("https://www.google.com/", {
driver: "cdp",
userAgent: "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_14_6) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/76.0.3809.87 Safari/537.36"
})
HOVER(google, 'input[name="q"]')
WAIT(RAND(100))
INPUT(google, 'input[name="q"]', @criteria, 30)
WAIT(RAND(100))
CLICK(google, 'input[name="btnK"]')
WAITFOR EVENT "navigation" IN google
WAIT_ELEMENT(google, "#res")
LET results = ELEMENTS(google, X("//*[text() = 'Search Results']/following-sibling::*/*"))
FOR el IN results
RETURN {
title: INNER_TEXT(el, 'h3')?,
description: INNER_TEXT(el, X("//em/parent::*")),
url: ELEMENT(el, 'a')?.attributes.href
}
from dude import select
"""
This example demonstrates how to use Parsel + async HTTPX
To access an attribute, use:
selector.attrib["href"]
You can also access an attribute using the ::attr(name) pseudo-element, for example "a::attr(href)", then:
selector.get()
To get the text, use ::text pseudo-element, then:
selector.get()
"""
@select(css="a.url", priority=2)
async def result_url(selector):
return {"url": selector.attrib["href"]}
# Option to get url using ::attr(name) pseudo-element
@select(css="a.url::attr(href)", priority=2)
async def result_url2(selector):
return {"url2": selector.get()}
@select(css=".title::text", priority=1)
async def result_title(selector):
return {"title": selector.get()}
@select(css=".description::text", priority=0)
async def result_description(selector):
return {"description": selector.get()}
if __name__ == "__main__":
import dude
dude.run(urls=["https://dude.ron.sh"], parser="parsel")