pholcusvsroach
Pholcus is a minimalistic web crawler library written in the Go programming language. It is designed to be flexible and easy to use, and it supports concurrent, distributed, and modular crawling.
Note that Pholcus is documented and maintained in the Chinese language and has no english resources other than the code source itself.
Roach is a complete web scraping toolkit for PHP. It is heavily inspired by the popular Scrapy package for Python.
Roach allows us to define spiders that crawl and scrape web documents. Roach isn’t just a simple crawler, but includes an entire pipeline to clean, persist and otherwise process extracted data as well.
Just like scrapy, Roach supports: - Middlewares - Item Pipelines - Extendibility through Plugins
It’s your all-in-one resource for web scraping in PHP.
Example Use
package main
import (
"github.com/henrylee2cn/pholcus/exec"
_ "github.com/henrylee2cn/pholcus/spider/standard" // standard spider
)
func main() {
// create spider object
spider := exec.NewSpider(exec.NewTask("demo", "https://www.example.com"))
// add a callback for URL route by regex pattern. In this case it's any route:
spider.AddRule(".*", "Parse")
// Start spider
spider.Start()
}
// define callback here
func Parse(self *exec.Spider, doc *goquery.Document) {
// callbacks receive HTMl document reference and
}
<?php
use RoachPHP\Http\Response;
use RoachPHP\Spider\BasicSpider;
class RoachDocsSpider extends BasicSpider
{
/**
* @var string[]
*/
public array $startUrls = [
'https://roach-php.dev/docs/spiders'
];
public function parse(Response $response): \Generator
{
$title = $response->filter('h1')->text();
$subtitle = $response
->filter('main > div:nth-child(2) p:first-of-type')
->text();
yield $this->item([
'title' => $title,
'subtitle' => $subtitle,
]);
}
}