dudevsscrapy
Dude (dude uncomplicated data extraction) is a very simple framework for writing web scrapers using Python decorators. The design, inspired by Flask, was to easily build a web scraper in just a few lines of code. Dude has an easy-to-learn syntax.
The simplest web scraper will look like this:
from dude import select
@select(css="a")
def get_link(element):
return {"url": element.get_attribute("href")}
dude supports multiple parser backends:
- playwright
- lxml
- parsel
- beautifulsoup
- pyppeteer
- selenium
Scrapy is an open-source Python library for web scraping. It allows developers to extract structured data from websites using a simple and consistent interface.
Scrapy provides:
- A built-in way to follow links and extract data from multiple pages (crawling)
- Handling common web scraping tasks such as logging in, handling cookies, and handling redirects.
Scrapy is built on top of the Twisted networking engine, which provides a non-blocking way to handle multiple requests at the same time, allowing Scrapy to efficiently scrape large websites.
It also comes with a built-in mechanism for handling common web scraping problems, such as:
- handling HTTP errors
- handling broken links
Scrapy also provide these features:
- Support for storing scraped data in various formats, such as CSV, JSON, and XML.
- Built-in support for selecting and extracting data using XPath or CSS selectors (through
parsel
). - Built-in support for handling common web scraping problems (like deduplication and url filtering).
- Ability to easily extend its functionality using middlewares.
- Ability to easily extend output processing using pipelines.
Highlights
Example Use
from dude import select
"""
This example demonstrates how to use Parsel + async HTTPX
To access an attribute, use:
selector.attrib["href"]
You can also access an attribute using the ::attr(name) pseudo-element, for example "a::attr(href)", then:
selector.get()
To get the text, use ::text pseudo-element, then:
selector.get()
"""
@select(css="a.url", priority=2)
async def result_url(selector):
return {"url": selector.attrib["href"]}
# Option to get url using ::attr(name) pseudo-element
@select(css="a.url::attr(href)", priority=2)
async def result_url2(selector):
return {"url2": selector.get()}
@select(css=".title::text", priority=1)
async def result_title(selector):
return {"title": selector.get()}
@select(css=".description::text", priority=0)
async def result_description(selector):
return {"description": selector.get()}
if __name__ == "__main__":
import dude
dude.run(urls=["https://dude.ron.sh"], parser="parsel")