pholcusvsgerapy
Pholcus is a minimalistic web crawler library written in the Go programming language. It is designed to be flexible and easy to use, and it supports concurrent, distributed, and modular crawling.
Note that Pholcus is documented and maintained in the Chinese language and has no english resources other than the code source itself.
Gerapy is a Distributed Crawler Management Framework Based on Scrapy, Scrapyd, Scrapyd-Client, Scrapyd-API, Django and Vue.js.
It is built on top of the Scrapy framework and provides a simple and easy-to-use interface for performing web scraping tasks. Gerapy also includes features such as support for scheduling and distributed crawling, as well as a built-in web-based dashboard for monitoring and managing scraping tasks. Additionally, Gerapy is designed to be highly extensible, allowing users to easily create custom plugins and integrations.
Overall, Gerapy is a useful tool for those looking to automate web scraping tasks and extract data from websites.
Example Use
package main
import (
"github.com/henrylee2cn/pholcus/exec"
_ "github.com/henrylee2cn/pholcus/spider/standard" // standard spider
)
func main() {
// create spider object
spider := exec.NewSpider(exec.NewTask("demo", "https://www.example.com"))
// add a callback for URL route by regex pattern. In this case it's any route:
spider.AddRule(".*", "Parse")
// Start spider
spider.Start()
}
// define callback here
func Parse(self *exec.Spider, doc *goquery.Document) {
// callbacks receive HTMl document reference and
}