Skip to content


MIT 24 2 411
27 (month) Feb 20 2022 0.1.3(8 months ago)
298 2 2 MIT
v1.7.2(17 days ago) Apr 18 2022 12 (month)

Dude (dude uncomplicated data extraction) is a very simple framework for writing web scrapers using Python decorators. The design, inspired by Flask, was to easily build a web scraper in just a few lines of code. Dude has an easy-to-learn syntax.

The simplest web scraper will look like this:

from dude import select

def get_link(element):
    return {"url": element.get_attribute("href")}

dude supports multiple parser backends: - playwright
- lxml
- parsel - beautifulsoup - pyppeteer - selenium

This library provides kind of a framework and a lot of ready to use, so-called steps, that you can use as building blocks, to build your own crawlers and scrapers with.

Some features: - Crawler Politeness innocent (respecting robots.txt, throttling,...) - Load URLs using - a (PSR-18) HTTP client (default is of course Guzzle) - or a headless browser (chrome) to get source after Javascript execution - Get absolute links from HTML documents link - Get sitemaps from robots.txt and get all URLs from those sitemaps - Crawl (load) all pages of a website spider - Use cookies (or don't) cookie - Use any HTTP methods (GET, POST,...) and send any headers or body - Iterate over paginated list pages repeat - Extract data from: - HTML and also XML (using CSS selectors or XPath queries) - JSON (using dot notation) - CSV (map columns) - Extract structured data in JSON-LD format from HTML documents - Keep memory usage low by using PHP Generators muscle - Cache HTTP responses during development, so you don't have to load pages again and again after every code change - Get logs about what your crawler is doing (accepts any PSR-3 LoggerInterface)

Example Use

from dude import select

This example demonstrates how to use Parsel + async HTTPX
To access an attribute, use:
You can also access an attribute using the ::attr(name) pseudo-element, for example "a::attr(href)", then:
To get the text, use ::text pseudo-element, then:

@select(css="a.url", priority=2)
async def result_url(selector):
    return {"url": selector.attrib["href"]}

# Option to get url using ::attr(name) pseudo-element
@select(css="a.url::attr(href)", priority=2)
async def result_url2(selector):
    return {"url2": selector.get()}

@select(css=".title::text", priority=1)
async def result_title(selector):
    return {"title": selector.get()}

@select(css=".description::text", priority=0)
async def result_description(selector):
    return {"description": selector.get()}

if __name__ == "__main__":
    import dude[""], parser="parsel")
require_once 'vendor/autoload.php';

use Crwlr\Crawler;

$crawler = new Crawler();
$crawler->get('', ['User-Agent' => '']);

// more links can be followed:

// and current page can be parsed:
$response = $crawler->response();
$title = $crawler->filter('title')->text();
echo $response->getContent();

Alternatives / Similar