Skip to content

pholcusvsautoscraper

Apache-2.0 6 1 7,526
Feb 15 2020 v1.3.4(4 years ago)
5,894 2 11 MIT
1.1.14(1 year, 8 months ago) Jul 26 2019 3.9 thousand (month)

Pholcus is a minimalistic web crawler library written in the Go programming language. It is designed to be flexible and easy to use, and it supports concurrent, distributed, and modular crawling.

Note that Pholcus is documented and maintained in the Chinese language and has no english resources other than the code source itself.

Autoscraper project is made for automatic web scraping to make scraping easy. It gets a url or the html content of a web page and a list of sample data which we want to scrape from that page. This data can be text, url or any html tag value of that page. It learns the scraping rules and returns the similar elements. Then you can use this learned object with new urls to get similar content or the exact same element of those new pages.

Autoscraper is minimalistic and auto-generative approach to web scraping. For example, here's a scraper that finds all titles on a stackoverflow.com page:

from autoscraper import AutoScraper

url = 'https://stackoverflow.com/questions/2081586/web-scraping-with-python'

# We can add one or multiple candidates here.
# You can also put urls here to retrieve urls.
wanted_list = ["What are metaclasses in Python?"]

scraper = AutoScraper()
result = scraper.build(url, wanted_list)
print(result)

Highlights


popularminimalisticauto-generating

Example Use


package main

import (
    "github.com/henrylee2cn/pholcus/exec"
    _ "github.com/henrylee2cn/pholcus/spider/standard" // standard spider
)

func main() {
    // create spider object
    spider := exec.NewSpider(exec.NewTask("demo", "https://www.example.com"))
    // add a callback for URL route by regex pattern. In this case it's any route:
    spider.AddRule(".*", "Parse")
    // Start spider
    spider.Start()
}

// define callback here
func Parse(self *exec.Spider, doc *goquery.Document) {
    // callbacks receive HTMl document reference and 
}

Alternatives / Similar