ferretvspholcus
Ferret is a web scraping system. It aims to simplify data extraction from the web for UI testing, machine learning, analytics and more. ferret allows users to focus on the data. It abstracts away the technical details and complexity of underlying technologies using its own declarative language. It is extremely portable, extensible, and fast.
Features
- Declarative language
- Support of both static and dynamic web pages
- Embeddable
- Extensible
Ferret is always implemented in Python through pyfer
Pholcus is a minimalistic web crawler library written in the Go programming language. It is designed to be flexible and easy to use, and it supports concurrent, distributed, and modular crawling.
Note that Pholcus is documented and maintained in the Chinese language and has no english resources other than the code source itself.
Example Use
// Example scraper for Google in Ferret:
LET google = DOCUMENT("https://www.google.com/", {
driver: "cdp",
userAgent: "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_14_6) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/76.0.3809.87 Safari/537.36"
})
HOVER(google, 'input[name="q"]')
WAIT(RAND(100))
INPUT(google, 'input[name="q"]', @criteria, 30)
WAIT(RAND(100))
CLICK(google, 'input[name="btnK"]')
WAITFOR EVENT "navigation" IN google
WAIT_ELEMENT(google, "#res")
LET results = ELEMENTS(google, X("//*[text() = 'Search Results']/following-sibling::*/*"))
FOR el IN results
RETURN {
title: INNER_TEXT(el, 'h3')?,
description: INNER_TEXT(el, X("//em/parent::*")),
url: ELEMENT(el, 'a')?.attributes.href
}
package main
import (
"github.com/henrylee2cn/pholcus/exec"
_ "github.com/henrylee2cn/pholcus/spider/standard" // standard spider
)
func main() {
// create spider object
spider := exec.NewSpider(exec.NewTask("demo", "https://www.example.com"))
// add a callback for URL route by regex pattern. In this case it's any route:
spider.AddRule(".*", "Parse")
// Start spider
spider.Start()
}
// define callback here
func Parse(self *exec.Spider, doc *goquery.Document) {
// callbacks receive HTMl document reference and
}