Skip to content

dudevsralger

AGPL-3.0 32 2 425
54 (month) Feb 20 2022 0.1.3(2023-08-01 20:28:33 ago)
165 1 3 MIT
Dec 22 2019 327 (month) 2.3.0(2021-03-18 00:10:00 ago)

Dude (dude uncomplicated data extraction) is a very simple framework for writing web scrapers using Python decorators. The design, inspired by Flask, was to easily build a web scraper in just a few lines of code. Dude has an easy-to-learn syntax.

The simplest web scraper will look like this: ```python from dude import select

@select(css="a") def get_link(element): return {"url": element.get_attribute("href")} ```

dude supports multiple parser backends: - playwright
- lxml
- parsel - beautifulsoup - pyppeteer - selenium

ralger is a small web scraping framework for R based on rvest and xml2.

It's goal to simplify basic web scraping and it provides a convenient and easy to use API.

It offers functions for retrieving pages, parsing HTML using CSS selectors, automatic table parsing and auto link, title, image and paragraph extraction.

Example Use


```python from dude import select """ This example demonstrates how to use Parsel + async HTTPX To access an attribute, use: selector.attrib["href"] You can also access an attribute using the ::attr(name) pseudo-element, for example "a::attr(href)", then: selector.get() To get the text, use ::text pseudo-element, then: selector.get() """ @select(css="a.url", priority=2) async def result_url(selector): return {"url": selector.attrib["href"]} # Option to get url using ::attr(name) pseudo-element @select(css="a.url::attr(href)", priority=2) async def result_url2(selector): return {"url2": selector.get()} @select(css=".title::text", priority=1) async def result_title(selector): return {"title": selector.get()} @select(css=".description::text", priority=0) async def result_description(selector): return {"description": selector.get()} if __name__ == "__main__": import dude dude.run(urls=["https://dude.ron.sh"], parser="parsel") ```
```r library("ralger") url <- "http://www.shanghairanking.com/rankings/arwu/2021" # retrieve HTML and select elements using CSS selectors: best_uni <- scrap(link = url, node = "a span", clean = TRUE) head(best_uni, 5) #> [1] "Harvard University" #> [2] "Stanford University" #> [3] "University of Cambridge" #> [4] "Massachusetts Institute of Technology (MIT)" #> [5] "University of California, Berkeley" # ralger can also parse HTML attributes attributes <- attribute_scrap( link = "https://ropensci.org/", node = "a", # the a tag attr = "class" # getting the class attribute ) head(attributes, 10) # NA values are a tags without a class attribute #> [1] "navbar-brand logo" "nav-link" NA #> [4] NA NA "nav-link" #> [7] NA "nav-link" NA #> [10] NA # # ralger can automatically scrape tables: data <- table_scrap(link ="https://www.boxofficemojo.com/chart/top_lifetime_gross/?area=XWW") head(data) #> # A tibble: 6 × 4 #> Rank Title `Lifetime Gross` Year #> #> 1 1 Avatar $2,847,397,339 2009 #> 2 2 Avengers: Endgame $2,797,501,328 2019 #> 3 3 Titanic $2,201,647,264 1997 #> 4 4 Star Wars: Episode VII - The Force Awakens $2,069,521,700 2015 #> 5 5 Avengers: Infinity War $2,048,359,754 2018 #> 6 6 Spider-Man: No Way Home $1,901,216,740 2021 ```

Alternatives / Similar


Was this page helpful?