Skip to content

dudevsscrapy

AGPL-3.0 32 2 425
54 (month) Feb 20 2022 0.1.3(2023-08-01 20:28:33 ago)
61,276 30 640 BSD-3-Clause
Jul 26 2019 3.1 million (month) 2.15.0(2026-04-09 12:02:09 ago)

Dude (dude uncomplicated data extraction) is a very simple framework for writing web scrapers using Python decorators. The design, inspired by Flask, was to easily build a web scraper in just a few lines of code. Dude has an easy-to-learn syntax.

The simplest web scraper will look like this: ```python from dude import select

@select(css="a") def get_link(element): return {"url": element.get_attribute("href")} ```

dude supports multiple parser backends: - playwright
- lxml
- parsel - beautifulsoup - pyppeteer - selenium

Scrapy is an open-source Python library for web scraping. It allows developers to extract structured data from websites using a simple and consistent interface.

Scrapy provides:

  • A built-in way to follow links and extract data from multiple pages (crawling)
  • Handling common web scraping tasks such as logging in, handling cookies, and handling redirects.

Scrapy is built on top of the Twisted networking engine, which provides a non-blocking way to handle multiple requests at the same time, allowing Scrapy to efficiently scrape large websites.

It also comes with a built-in mechanism for handling common web scraping problems, such as:

  • handling HTTP errors
  • handling broken links

Scrapy also provide these features:

  • Support for storing scraped data in various formats, such as CSV, JSON, and XML.
  • Built-in support for selecting and extracting data using XPath or CSS selectors (through parsel).
  • Built-in support for handling common web scraping problems (like deduplication and url filtering).
  • Ability to easily extend its functionality using middlewares.
  • Ability to easily extend output processing using pipelines.

Highlights


popularcss-selectorsxpath-selectorscommunity-toolsoutput-pipelinesmiddlewaresasyncproductionlarge-scale

Example Use


```python from dude import select """ This example demonstrates how to use Parsel + async HTTPX To access an attribute, use: selector.attrib["href"] You can also access an attribute using the ::attr(name) pseudo-element, for example "a::attr(href)", then: selector.get() To get the text, use ::text pseudo-element, then: selector.get() """ @select(css="a.url", priority=2) async def result_url(selector): return {"url": selector.attrib["href"]} # Option to get url using ::attr(name) pseudo-element @select(css="a.url::attr(href)", priority=2) async def result_url2(selector): return {"url2": selector.get()} @select(css=".title::text", priority=1) async def result_title(selector): return {"title": selector.get()} @select(css=".description::text", priority=0) async def result_description(selector): return {"description": selector.get()} if __name__ == "__main__": import dude dude.run(urls=["https://dude.ron.sh"], parser="parsel") ```

Alternatives / Similar


Was this page helpful?