autoscrapervspholcus

MIT 1 2 6,638

3.0 thousand (month) Jul 26 2019 1.1.14(3 years ago)

7,580 1 7 Apache-2.0

Feb 15 2020 v1.3.4(5 years ago)

Autoscraper project is made for automatic web scraping to make scraping easy. It gets a url or the html content of a web page and a list of sample data which we want to scrape from that page. This data can be text, url or any html tag value of that page. It learns the scraping rules and returns the similar elements. Then you can use this learned object with new urls to get similar content or the exact same element of those new pages.

Autoscraper is minimalistic and auto-generative approach to web scraping. For example, here's a scraper that finds all titles on a stackoverflow.com page:

from autoscraper import AutoScraper

url = 'https://stackoverflow.com/questions/2081586/web-scraping-with-python'

# We can add one or multiple candidates here.
# You can also put urls here to retrieve urls.
wanted_list = ["What are metaclasses in Python?"]

scraper = AutoScraper()
result = scraper.build(url, wanted_list)
print(result)

Pholcus is a minimalistic web crawler library written in the Go programming language. It is designed to be flexible and easy to use, and it supports concurrent, distributed, and modular crawling.

Note that Pholcus is documented and maintained in the Chinese language and has no english resources other than the code source itself.

Highlights

popularminimalisticauto-generating

Example Use

package main

import (
    "github.com/henrylee2cn/pholcus/exec"
    _ "github.com/henrylee2cn/pholcus/spider/standard" // standard spider
)

func main() {
    // create spider object
    spider := exec.NewSpider(exec.NewTask("demo", "https://www.example.com"))
    // add a callback for URL route by regex pattern. In this case it's any route:
    spider.AddRule(".*", "Parse")
    // Start spider
    spider.Start()
}

// define callback here
func Parse(self *exec.Spider, doc *goquery.Document) {
    // callbacks receive HTMl document reference and 
}

Alternatives / Similar

colly

23,747 compare

pholcus

7,580 compare

geziyor

2,667 compare

dataflowkit

676 compare

scrapy

54,211 compare

rvest

1,498 compare

gocrawl

2,039 compare

ferret

5,716 compare

scrapyd

2,980 compare

node-crawler

6,733 compare

panther

2,977 compare

gracy

247 compare

spidr

813 compare

scrapydweb

3,218 compare

gerapy

3,365 compare

wombat

1,316 compare

ruia

1,754 compare

photon

11,149 compare

ralger

156 compare

roach

1,384 compare

dude

428 compare

ayakashi

213 compare

phpscraper

554 compare

php-spider

1,335 compare

crwlr-crawler

356 compare