Skip to content

ruiavsphpscraper

MIT 8 3 1,730
1.1 thousand (month) Oct 17 2018 0.8.5(1 year, 5 months ago)
472 2 21 GPL-3.0-or-later
2.0.0(8 months ago) May 04 2020 127 (month)

Ruia is an async web scraping micro-framework, written with asyncio and aiohttp, aims to make crawling url as convenient as possible.

Ruia is inspired by scrapy however instead of Twisted it's based entirely on asyncio and aiohttp.

It also supports various features like cookies, headers, and proxy, which makes it very useful in dealing with complex web scraping tasks.

PHPScraper is a universal web-util for PHP. The main goal is to get stuff done instead of getting distracted with selectors, preparing & converting data structures, etc. Instead, you can just go to a website and get the relevant information for your project.

PHPScraper is a minimalistic scraper framework that is built on top of other popular scraping tools.

Features:

  • Direct access to page basic features like: Meta data, Links, Images, Headings, Content, Keywords etc.
  • File downloading.
  • RSS, Sitemap and other feed processing.
  • CSV, XML and JSON file processing.

Example Use


#!/usr/bin/env python
"""
 Target: https://news.ycombinator.com/
 pip install aiofiles
"""
import aiofiles

from ruia import AttrField, Item, Spider, TextField


class HackerNewsItem(Item):
    target_item = TextField(css_select="tr.athing")
    title = TextField(css_select="a.storylink")
    url = AttrField(css_select="a.storylink", attr="href")

    async def clean_title(self, value):
        return value.strip()


class HackerNewsSpider(Spider):
    start_urls = [
        "https://news.ycombinator.com/news?p=1",
        "https://news.ycombinator.com/news?p=2",
    ]
    concurrency = 10
    # aiohttp_kwargs = {"proxy": "http://0.0.0.0:1087"}

    async def parse(self, response):
        async for item in HackerNewsItem.get_items(html=await response.text()):
            yield item

    async def process_item(self, item: HackerNewsItem):
        async with aiofiles.open("./hacker_news.txt", "a") as f:
            self.logger.info(item)
            await f.write(str(item.title) + "\n")


if __name__ == "__main__":
    HackerNewsSpider.start(middleware=None)
// create scraper object
$web = new \Spekulatius\PHPScraper\PHPScraper;
// go to URL
$web->go('https://test-pages.phpscraper.de/content/selectors.html');

// elements can be found using XPath:
echo $web->filter("//*[@id='by-id']")->text();   // "Content by ID"

// or pre-defined variables covering basic page data:
$web->links;  // for all links
$web->headings;
$web->images;
$web->contentKeywords;
$web->orderedLists;
$web->unorderedLists;
$web->paragraphs;
$web->outline;  // basic page outline
$web->cleanOutlineWithParagraphs;  // basic page outline

Alternatives / Similar