ferretvsscrapydweb
Ferret is a web scraping system. It aims to simplify data extraction from the web for UI testing, machine learning, analytics and more. ferret allows users to focus on the data. It abstracts away the technical details and complexity of underlying technologies using its own declarative language. It is extremely portable, extensible, and fast.
Features
- Declarative language
- Support of both static and dynamic web pages
- Embeddable
- Extensible
Ferret is always implemented in Python through pyfer
ScrapydWeb is a web-based management tool for the Scrapyd service. It is built using the Python Flask framework and allows you to easily manage and monitor your Scrapy spider projects through a web interface.
ScrapydWeb allows you to view the status of your running spiders, view the logs of completed spiders, schedule new spider runs, and manage spider settings and configurations.
ScrapydWeb provides a simple way to manage your scraping tasks and allows you to schedule and run multiple spiders simultaneously. It also provides a user-friendly web interface that makes it easy to view the status of your spiders and monitor their progress.
You can install the package via pip by running pip install scrapydweb
and then you can run the package by
running scrapydweb command in your command prompt.
It will start a web server that you can access through your web browser at http://localhost:6800/
You will need to have Scrapyd running in order to use ScrapydWeb,
Scrapyd is a service for running Scrapy spiders, it allows you to schedule spiders to run at regular intervals
and also allows you to run spiders on remote machines.
Example Use
// Example scraper for Google in Ferret:
LET google = DOCUMENT("https://www.google.com/", {
driver: "cdp",
userAgent: "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_14_6) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/76.0.3809.87 Safari/537.36"
})
HOVER(google, 'input[name="q"]')
WAIT(RAND(100))
INPUT(google, 'input[name="q"]', @criteria, 30)
WAIT(RAND(100))
CLICK(google, 'input[name="btnK"]')
WAITFOR EVENT "navigation" IN google
WAIT_ELEMENT(google, "#res")
LET results = ELEMENTS(google, X("//*[text() = 'Search Results']/following-sibling::*/*"))
FOR el IN results
RETURN {
title: INNER_TEXT(el, 'h3')?,
description: INNER_TEXT(el, X("//em/parent::*")),
url: ELEMENT(el, 'a')?.attributes.href
}