Skip to content

dudevsgracy

MIT 24 2 411
27 (month) Feb 20 2022 0.1.3(8 months ago)
235 2 1 MIT
1.33.0(a month ago) Feb 05 2023 917 (month)

Dude (dude uncomplicated data extraction) is a very simple framework for writing web scrapers using Python decorators. The design, inspired by Flask, was to easily build a web scraper in just a few lines of code. Dude has an easy-to-learn syntax.

The simplest web scraper will look like this:

from dude import select


@select(css="a")
def get_link(element):
    return {"url": element.get_attribute("href")}

dude supports multiple parser backends: - playwright
- lxml
- parsel - beautifulsoup - pyppeteer - selenium

Gracy is an API client library based on httpx that provides an extra stability layer with:

  • Retry logic
  • Logging
  • Connection throttling
  • Tracking/Middleware

In web scraping, Gracy can be a convenient tool for creating scraper based API clients.

Example Use


from dude import select

"""
This example demonstrates how to use Parsel + async HTTPX
To access an attribute, use:
    selector.attrib["href"]
You can also access an attribute using the ::attr(name) pseudo-element, for example "a::attr(href)", then:
    selector.get()
To get the text, use ::text pseudo-element, then:
    selector.get()
"""


@select(css="a.url", priority=2)
async def result_url(selector):
    return {"url": selector.attrib["href"]}


# Option to get url using ::attr(name) pseudo-element
@select(css="a.url::attr(href)", priority=2)
async def result_url2(selector):
    return {"url2": selector.get()}


@select(css=".title::text", priority=1)
async def result_title(selector):
    return {"title": selector.get()}


@select(css=".description::text", priority=0)
async def result_description(selector):
    return {"description": selector.get()}


if __name__ == "__main__":
    import dude

    dude.run(urls=["https://dude.ron.sh"], parser="parsel")
# 0. Import
import asyncio
from typing import Awaitable
from gracy import BaseEndpoint, Gracy, GracyConfig, LogEvent, LogLevel

# 1. Define your endpoints
class PokeApiEndpoint(BaseEndpoint):
    GET_POKEMON = "/pokemon/{NAME}" # 👈 Put placeholders as needed

# 2. Define your Graceful API
class GracefulPokeAPI(Gracy[str]):
    class Config:  # type: ignore
        BASE_URL = "https://pokeapi.co/api/v2/" # 👈 Optional BASE_URL
        # 👇 Define settings to apply for every request
        SETTINGS = GracyConfig(
          log_request=LogEvent(LogLevel.DEBUG),
          log_response=LogEvent(LogLevel.INFO, "{URL} took {ELAPSED}"),
          parser={
            "default": lambda r: r.json()
          }
        )

    async def get_pokemon(self, name: str) -> Awaitable[dict]:
        return await self.get(PokeApiEndpoint.GET_POKEMON, {"NAME": name})

    # Note: since Gracy is based on httpx we can customized the used client with custom headers etc"
    def _create_client(self) -> httpx.AsyncClient:
        client = super()._create_client()
        client.headers = {"User-Agent": f"My Scraper"} 
        return client

pokeapi = GracefulPokeAPI()

async def main():
    try:
      pokemon = await pokeapi.get_pokemon("pikachu")
      print(pokemon)

    finally:
        pokeapi.report_status("rich")


asyncio.run(main())

Alternatives / Similar