ferretvsscrapy

Apache-2.0 52 7 5,716

58.1 thousand (month) Aug 06 2019 v0.18.0(1 year, 10 months ago)

54,211 30 619 BSD-3-Clause

Jul 26 2019 1.4 million (month) 2.12.0(3 months ago)

Ferret is a web scraping system. It aims to simplify data extraction from the web for UI testing, machine learning, analytics and more. ferret allows users to focus on the data. It abstracts away the technical details and complexity of underlying technologies using its own declarative language. It is extremely portable, extensible, and fast.

Features

Declarative language
Support of both static and dynamic web pages
Embeddable
Extensible

Ferret is always implemented in Python through pyfer

Scrapy is an open-source Python library for web scraping. It allows developers to extract structured data from websites using a simple and consistent interface.

Scrapy provides:

A built-in way to follow links and extract data from multiple pages (crawling)
Handling common web scraping tasks such as logging in, handling cookies, and handling redirects.

Scrapy is built on top of the Twisted networking engine, which provides a non-blocking way to handle multiple requests at the same time, allowing Scrapy to efficiently scrape large websites.

It also comes with a built-in mechanism for handling common web scraping problems, such as:

handling HTTP errors
handling broken links

Scrapy also provide these features:

Support for storing scraped data in various formats, such as CSV, JSON, and XML.
Built-in support for selecting and extracting data using XPath or CSS selectors (through parsel).
Built-in support for handling common web scraping problems (like deduplication and url filtering).
Ability to easily extend its functionality using middlewares.
Ability to easily extend output processing using pipelines.

Highlights

popularcss-selectorsxpath-selectorscommunity-toolsoutput-pipelinesmiddlewaresasyncproductionlarge-scale

Example Use

// Example scraper for Google in Ferret:
LET google = DOCUMENT("https://www.google.com/", {
    driver: "cdp",
    userAgent: "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_14_6) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/76.0.3809.87 Safari/537.36"
})

HOVER(google, 'input[name="q"]')
WAIT(RAND(100))
INPUT(google, 'input[name="q"]', @criteria, 30)
WAIT(RAND(100))
CLICK(google, 'input[name="btnK"]')

WAITFOR EVENT "navigation" IN google

WAIT_ELEMENT(google, "#res")

LET results = ELEMENTS(google, X("//*[text() = 'Search Results']/following-sibling::*/*"))

FOR el IN results
    RETURN {
        title: INNER_TEXT(el, 'h3')?,
        description: INNER_TEXT(el, X("//em/parent::*")),
        url: ELEMENT(el, 'a')?.attributes.href
    }

Alternatives / Similar

colly

23,747 compare

pholcus

7,580 compare

geziyor

2,667 compare

dataflowkit

676 compare

scrapy

54,211 compare

rvest

1,498 compare

gocrawl

2,039 compare

scrapyd

2,980 compare

node-crawler

6,733 compare

panther

2,977 compare

autoscraper

6,638 compare

gracy

247 compare

spidr

813 compare

scrapydweb

3,218 compare

gerapy

3,365 compare

wombat

1,316 compare

ruia

1,754 compare

photon

11,149 compare

ralger

156 compare

roach

1,384 compare

dude

428 compare

ayakashi

213 compare

phpscraper

554 compare

php-spider

1,335 compare

crwlr-crawler

356 compare