Skip to content

ruiavsdude

MIT 8 3 1,731
469 (month) Oct 17 2018 0.8.5(1 year, 6 months ago)
411 2 24 MIT
0.1.3(8 months ago) Feb 20 2022 27 (month)

Ruia is an async web scraping micro-framework, written with asyncio and aiohttp, aims to make crawling url as convenient as possible.

Ruia is inspired by scrapy however instead of Twisted it's based entirely on asyncio and aiohttp.

It also supports various features like cookies, headers, and proxy, which makes it very useful in dealing with complex web scraping tasks.

Dude (dude uncomplicated data extraction) is a very simple framework for writing web scrapers using Python decorators. The design, inspired by Flask, was to easily build a web scraper in just a few lines of code. Dude has an easy-to-learn syntax.

The simplest web scraper will look like this:

from dude import select


@select(css="a")
def get_link(element):
    return {"url": element.get_attribute("href")}

dude supports multiple parser backends: - playwright
- lxml
- parsel - beautifulsoup - pyppeteer - selenium

Example Use


#!/usr/bin/env python
"""
 Target: https://news.ycombinator.com/
 pip install aiofiles
"""
import aiofiles

from ruia import AttrField, Item, Spider, TextField


class HackerNewsItem(Item):
    target_item = TextField(css_select="tr.athing")
    title = TextField(css_select="a.storylink")
    url = AttrField(css_select="a.storylink", attr="href")

    async def clean_title(self, value):
        return value.strip()


class HackerNewsSpider(Spider):
    start_urls = [
        "https://news.ycombinator.com/news?p=1",
        "https://news.ycombinator.com/news?p=2",
    ]
    concurrency = 10
    # aiohttp_kwargs = {"proxy": "http://0.0.0.0:1087"}

    async def parse(self, response):
        async for item in HackerNewsItem.get_items(html=await response.text()):
            yield item

    async def process_item(self, item: HackerNewsItem):
        async with aiofiles.open("./hacker_news.txt", "a") as f:
            self.logger.info(item)
            await f.write(str(item.title) + "\n")


if __name__ == "__main__":
    HackerNewsSpider.start(middleware=None)
from dude import select

"""
This example demonstrates how to use Parsel + async HTTPX
To access an attribute, use:
    selector.attrib["href"]
You can also access an attribute using the ::attr(name) pseudo-element, for example "a::attr(href)", then:
    selector.get()
To get the text, use ::text pseudo-element, then:
    selector.get()
"""


@select(css="a.url", priority=2)
async def result_url(selector):
    return {"url": selector.attrib["href"]}


# Option to get url using ::attr(name) pseudo-element
@select(css="a.url::attr(href)", priority=2)
async def result_url2(selector):
    return {"url2": selector.get()}


@select(css=".title::text", priority=1)
async def result_title(selector):
    return {"title": selector.get()}


@select(css=".description::text", priority=0)
async def result_description(selector):
    return {"description": selector.get()}


if __name__ == "__main__":
    import dude

    dude.run(urls=["https://dude.ron.sh"], parser="parsel")

Alternatives / Similar