gracyvsscrapy
Gracy is an API client library based on httpx that provides an extra stability layer with:
- Retry logic
- Logging
- Connection throttling
- Tracking/Middleware
In web scraping, Gracy can be a convenient tool for creating scraper based API clients.
Scrapy is an open-source Python library for web scraping. It allows developers to extract structured data from websites using a simple and consistent interface.
Scrapy provides:
- A built-in way to follow links and extract data from multiple pages (crawling)
- Handling common web scraping tasks such as logging in, handling cookies, and handling redirects.
Scrapy is built on top of the Twisted networking engine, which provides a non-blocking way to handle multiple requests at the same time, allowing Scrapy to efficiently scrape large websites.
It also comes with a built-in mechanism for handling common web scraping problems, such as:
- handling HTTP errors
- handling broken links
Scrapy also provide these features:
- Support for storing scraped data in various formats, such as CSV, JSON, and XML.
- Built-in support for selecting and extracting data using XPath or CSS selectors (through
parsel
). - Built-in support for handling common web scraping problems (like deduplication and url filtering).
- Ability to easily extend its functionality using middlewares.
- Ability to easily extend output processing using pipelines.
Highlights
popularcss-selectorsxpath-selectorscommunity-toolsoutput-pipelinesmiddlewaresasyncproductionlarge-scale
Example Use
# 0. Import
import asyncio
from typing import Awaitable
from gracy import BaseEndpoint, Gracy, GracyConfig, LogEvent, LogLevel
# 1. Define your endpoints
class PokeApiEndpoint(BaseEndpoint):
GET_POKEMON = "/pokemon/{NAME}" # 👈 Put placeholders as needed
# 2. Define your Graceful API
class GracefulPokeAPI(Gracy[str]):
class Config: # type: ignore
BASE_URL = "https://pokeapi.co/api/v2/" # 👈 Optional BASE_URL
# 👇 Define settings to apply for every request
SETTINGS = GracyConfig(
log_request=LogEvent(LogLevel.DEBUG),
log_response=LogEvent(LogLevel.INFO, "{URL} took {ELAPSED}"),
parser={
"default": lambda r: r.json()
}
)
async def get_pokemon(self, name: str) -> Awaitable[dict]:
return await self.get(PokeApiEndpoint.GET_POKEMON, {"NAME": name})
# Note: since Gracy is based on httpx we can customized the used client with custom headers etc"
def _create_client(self) -> httpx.AsyncClient:
client = super()._create_client()
client.headers = {"User-Agent": f"My Scraper"}
return client
pokeapi = GracefulPokeAPI()
async def main():
try:
pokemon = await pokeapi.get_pokemon("pikachu")
print(pokemon)
finally:
pokeapi.report_status("rich")
asyncio.run(main())