Skip to content

ruiavsgracy

MIT 8 3 1,731
469 (month) Oct 17 2018 0.8.5(1 year, 6 months ago)
235 2 1 MIT
1.33.0(a month ago) Feb 05 2023 917 (month)

Ruia is an async web scraping micro-framework, written with asyncio and aiohttp, aims to make crawling url as convenient as possible.

Ruia is inspired by scrapy however instead of Twisted it's based entirely on asyncio and aiohttp.

It also supports various features like cookies, headers, and proxy, which makes it very useful in dealing with complex web scraping tasks.

Gracy is an API client library based on httpx that provides an extra stability layer with:

  • Retry logic
  • Logging
  • Connection throttling
  • Tracking/Middleware

In web scraping, Gracy can be a convenient tool for creating scraper based API clients.

Example Use


#!/usr/bin/env python
"""
 Target: https://news.ycombinator.com/
 pip install aiofiles
"""
import aiofiles

from ruia import AttrField, Item, Spider, TextField


class HackerNewsItem(Item):
    target_item = TextField(css_select="tr.athing")
    title = TextField(css_select="a.storylink")
    url = AttrField(css_select="a.storylink", attr="href")

    async def clean_title(self, value):
        return value.strip()


class HackerNewsSpider(Spider):
    start_urls = [
        "https://news.ycombinator.com/news?p=1",
        "https://news.ycombinator.com/news?p=2",
    ]
    concurrency = 10
    # aiohttp_kwargs = {"proxy": "http://0.0.0.0:1087"}

    async def parse(self, response):
        async for item in HackerNewsItem.get_items(html=await response.text()):
            yield item

    async def process_item(self, item: HackerNewsItem):
        async with aiofiles.open("./hacker_news.txt", "a") as f:
            self.logger.info(item)
            await f.write(str(item.title) + "\n")


if __name__ == "__main__":
    HackerNewsSpider.start(middleware=None)
# 0. Import
import asyncio
from typing import Awaitable
from gracy import BaseEndpoint, Gracy, GracyConfig, LogEvent, LogLevel

# 1. Define your endpoints
class PokeApiEndpoint(BaseEndpoint):
    GET_POKEMON = "/pokemon/{NAME}" # 👈 Put placeholders as needed

# 2. Define your Graceful API
class GracefulPokeAPI(Gracy[str]):
    class Config:  # type: ignore
        BASE_URL = "https://pokeapi.co/api/v2/" # 👈 Optional BASE_URL
        # 👇 Define settings to apply for every request
        SETTINGS = GracyConfig(
          log_request=LogEvent(LogLevel.DEBUG),
          log_response=LogEvent(LogLevel.INFO, "{URL} took {ELAPSED}"),
          parser={
            "default": lambda r: r.json()
          }
        )

    async def get_pokemon(self, name: str) -> Awaitable[dict]:
        return await self.get(PokeApiEndpoint.GET_POKEMON, {"NAME": name})

    # Note: since Gracy is based on httpx we can customized the used client with custom headers etc"
    def _create_client(self) -> httpx.AsyncClient:
        client = super()._create_client()
        client.headers = {"User-Agent": f"My Scraper"} 
        return client

pokeapi = GracefulPokeAPI()

async def main():
    try:
      pokemon = await pokeapi.get_pokemon("pikachu")
      print(pokemon)

    finally:
        pokeapi.report_status("rich")


asyncio.run(main())

Alternatives / Similar