geziyorvsroach
Geziyor is a blazing fast web crawling and web scraping framework. It can be used to crawl websites and extract structured data from them. Geziyor is useful for a wide range of purposes such as data mining, monitoring and automated testing.
Features:
- JS Rendering
- 5.000+ Requests/Sec
- Caching (Memory/Disk/LevelDB)
- Automatic Data Exporting (JSON, CSV, or custom)
- Metrics (Prometheus, Expvar, or custom)
- Limit Concurrency (Global/Per Domain)
- Request Delays (Constant/Randomized)
- Cookies, Middlewares, robots.txt
- Automatic response decoding to UTF-8
- Proxy management (Single, Round-Robin, Custom)
Roach is a complete web scraping toolkit for PHP. It is heavily inspired by the popular Scrapy package for Python.
Roach allows us to define spiders that crawl and scrape web documents. Roach isn’t just a simple crawler, but includes an entire pipeline to clean, persist and otherwise process extracted data as well.
Just like scrapy, Roach supports: - Middlewares - Item Pipelines - Extendibility through Plugins
It’s your all-in-one resource for web scraping in PHP.
Example Use
// This example extracts all quotes from quotes.toscrape.com and exports to JSON file.
func main() {
geziyor.NewGeziyor(&geziyor.Options{
StartURLs: []string{"http://quotes.toscrape.com/"},
ParseFunc: quotesParse,
Exporters: []export.Exporter{&export.JSON{}},
}).Start()
}
func quotesParse(g *geziyor.Geziyor, r *client.Response) {
r.HTMLDoc.Find("div.quote").Each(func(i int, s *goquery.Selection) {
g.Exports <- map[string]interface{}{
"text": s.Find("span.text").Text(),
"author": s.Find("small.author").Text(),
}
})
if href, ok := r.HTMLDoc.Find("li.next > a").Attr("href"); ok {
g.Get(r.JoinURL(href), quotesParse)
}
}
<?php
use RoachPHP\Http\Response;
use RoachPHP\Spider\BasicSpider;
class RoachDocsSpider extends BasicSpider
{
/**
* @var string[]
*/
public array $startUrls = [
'https://roach-php.dev/docs/spiders'
];
public function parse(Response $response): \Generator
{
$title = $response->filter('h1')->text();
$subtitle = $response
->filter('main > div:nth-child(2) p:first-of-type')
->text();
yield $this->item([
'title' => $title,
'subtitle' => $subtitle,
]);
}
}