Skip to content

dudevsgracy

AGPL-3.0 32 2 425
54 (month) Feb 20 2022 0.1.3(2023-08-01 20:28:33 ago)
248 2 - MIT
Feb 05 2023 6.8 thousand (month) 1.34.0(2024-11-27 14:57:34 ago)

Dude (dude uncomplicated data extraction) is a very simple framework for writing web scrapers using Python decorators. The design, inspired by Flask, was to easily build a web scraper in just a few lines of code. Dude has an easy-to-learn syntax.

The simplest web scraper will look like this: ```python from dude import select

@select(css="a") def get_link(element): return {"url": element.get_attribute("href")} ```

dude supports multiple parser backends: - playwright
- lxml
- parsel - beautifulsoup - pyppeteer - selenium

Gracy is an API client library based on httpx that provides an extra stability layer with:

  • Retry logic
  • Logging
  • Connection throttling
  • Tracking/Middleware

In web scraping, Gracy can be a convenient tool for creating scraper based API clients.

Example Use


```python from dude import select """ This example demonstrates how to use Parsel + async HTTPX To access an attribute, use: selector.attrib["href"] You can also access an attribute using the ::attr(name) pseudo-element, for example "a::attr(href)", then: selector.get() To get the text, use ::text pseudo-element, then: selector.get() """ @select(css="a.url", priority=2) async def result_url(selector): return {"url": selector.attrib["href"]} # Option to get url using ::attr(name) pseudo-element @select(css="a.url::attr(href)", priority=2) async def result_url2(selector): return {"url2": selector.get()} @select(css=".title::text", priority=1) async def result_title(selector): return {"title": selector.get()} @select(css=".description::text", priority=0) async def result_description(selector): return {"description": selector.get()} if __name__ == "__main__": import dude dude.run(urls=["https://dude.ron.sh"], parser="parsel") ```
```python # 0. Import import asyncio from typing import Awaitable from gracy import BaseEndpoint, Gracy, GracyConfig, LogEvent, LogLevel # 1. Define your endpoints class PokeApiEndpoint(BaseEndpoint): GET_POKEMON = "/pokemon/{NAME}" # 👈 Put placeholders as needed # 2. Define your Graceful API class GracefulPokeAPI(Gracy[str]): class Config: # type: ignore BASE_URL = "https://pokeapi.co/api/v2/" # 👈 Optional BASE_URL # 👇 Define settings to apply for every request SETTINGS = GracyConfig( log_request=LogEvent(LogLevel.DEBUG), log_response=LogEvent(LogLevel.INFO, "{URL} took {ELAPSED}"), parser={ "default": lambda r: r.json() } ) async def get_pokemon(self, name: str) -> Awaitable[dict]: return await self.get(PokeApiEndpoint.GET_POKEMON, {"NAME": name}) # Note: since Gracy is based on httpx we can customized the used client with custom headers etc" def _create_client(self) -> httpx.AsyncClient: client = super()._create_client() client.headers = {"User-Agent": f"My Scraper"} return client pokeapi = GracefulPokeAPI() async def main(): try: pokemon = await pokeapi.get_pokemon("pikachu") print(pokemon) finally: pokeapi.report_status("rich") asyncio.run(main()) ```

Alternatives / Similar


Was this page helpful?