roachvsruia
Roach is a complete web scraping toolkit for PHP. It is heavily inspired by the popular Scrapy package for Python.
Roach allows us to define spiders that crawl and scrape web documents. Roach isn’t just a simple crawler, but includes an entire pipeline to clean, persist and otherwise process extracted data as well.
Just like scrapy, Roach supports: - Middlewares - Item Pipelines - Extendibility through Plugins
It’s your all-in-one resource for web scraping in PHP.
Ruia is an async web scraping micro-framework, written with asyncio and aiohttp, aims to make crawling url as convenient as possible.
Ruia is inspired by scrapy however instead of Twisted it's based entirely on asyncio and aiohttp.
It also supports various features like cookies, headers, and proxy, which makes it very useful in dealing with complex web scraping tasks.
Example Use
<?php
use RoachPHP\Http\Response;
use RoachPHP\Spider\BasicSpider;
class RoachDocsSpider extends BasicSpider
{
/**
* @var string[]
*/
public array $startUrls = [
'https://roach-php.dev/docs/spiders'
];
public function parse(Response $response): \Generator
{
$title = $response->filter('h1')->text();
$subtitle = $response
->filter('main > div:nth-child(2) p:first-of-type')
->text();
yield $this->item([
'title' => $title,
'subtitle' => $subtitle,
]);
}
}
#!/usr/bin/env python
"""
Target: https://news.ycombinator.com/
pip install aiofiles
"""
import aiofiles
from ruia import AttrField, Item, Spider, TextField
class HackerNewsItem(Item):
target_item = TextField(css_select="tr.athing")
title = TextField(css_select="a.storylink")
url = AttrField(css_select="a.storylink", attr="href")
async def clean_title(self, value):
return value.strip()
class HackerNewsSpider(Spider):
start_urls = [
"https://news.ycombinator.com/news?p=1",
"https://news.ycombinator.com/news?p=2",
]
concurrency = 10
# aiohttp_kwargs = {"proxy": "http://0.0.0.0:1087"}
async def parse(self, response):
async for item in HackerNewsItem.get_items(html=await response.text()):
yield item
async def process_item(self, item: HackerNewsItem):
async with aiofiles.open("./hacker_news.txt", "a") as f:
self.logger.info(item)
await f.write(str(item.title) + "\n")
if __name__ == "__main__":
HackerNewsSpider.start(middleware=None)