Skip to content

ruiavsgracy

Apache-2.0 9 3 1,743
414 (month) Oct 17 2018 0.8.5(2022-09-06 08:54:56 ago)
248 2 - MIT
Feb 05 2023 6.8 thousand (month) 1.34.0(2024-11-27 14:57:34 ago)

Ruia is an async web scraping micro-framework, written with asyncio and aiohttp, aims to make crawling url as convenient as possible.

Ruia is inspired by scrapy however instead of Twisted it's based entirely on asyncio and aiohttp.

It also supports various features like cookies, headers, and proxy, which makes it very useful in dealing with complex web scraping tasks.

Gracy is an API client library based on httpx that provides an extra stability layer with:

  • Retry logic
  • Logging
  • Connection throttling
  • Tracking/Middleware

In web scraping, Gracy can be a convenient tool for creating scraper based API clients.

Example Use


```python #!/usr/bin/env python """ Target: https://news.ycombinator.com/ pip install aiofiles """ import aiofiles from ruia import AttrField, Item, Spider, TextField class HackerNewsItem(Item): target_item = TextField(css_select="tr.athing") title = TextField(css_select="a.storylink") url = AttrField(css_select="a.storylink", attr="href") async def clean_title(self, value): return value.strip() class HackerNewsSpider(Spider): start_urls = [ "https://news.ycombinator.com/news?p=1", "https://news.ycombinator.com/news?p=2", ] concurrency = 10 # aiohttp_kwargs = {"proxy": "http://0.0.0.0:1087"} async def parse(self, response): async for item in HackerNewsItem.get_items(html=await response.text()): yield item async def process_item(self, item: HackerNewsItem): async with aiofiles.open("./hacker_news.txt", "a") as f: self.logger.info(item) await f.write(str(item.title) + "\n") if __name__ == "__main__": HackerNewsSpider.start(middleware=None) ```
```python # 0. Import import asyncio from typing import Awaitable from gracy import BaseEndpoint, Gracy, GracyConfig, LogEvent, LogLevel # 1. Define your endpoints class PokeApiEndpoint(BaseEndpoint): GET_POKEMON = "/pokemon/{NAME}" # 👈 Put placeholders as needed # 2. Define your Graceful API class GracefulPokeAPI(Gracy[str]): class Config: # type: ignore BASE_URL = "https://pokeapi.co/api/v2/" # 👈 Optional BASE_URL # 👇 Define settings to apply for every request SETTINGS = GracyConfig( log_request=LogEvent(LogLevel.DEBUG), log_response=LogEvent(LogLevel.INFO, "{URL} took {ELAPSED}"), parser={ "default": lambda r: r.json() } ) async def get_pokemon(self, name: str) -> Awaitable[dict]: return await self.get(PokeApiEndpoint.GET_POKEMON, {"NAME": name}) # Note: since Gracy is based on httpx we can customized the used client with custom headers etc" def _create_client(self) -> httpx.AsyncClient: client = super()._create_client() client.headers = {"User-Agent": f"My Scraper"} return client pokeapi = GracefulPokeAPI() async def main(): try: pokemon = await pokeapi.get_pokemon("pikachu") print(pokemon) finally: pokeapi.report_status("rich") asyncio.run(main()) ```

Alternatives / Similar


Was this page helpful?