autoscrapervsdude

MIT 1 2 6,638

3.0 thousand (month) Jul 26 2019 1.1.14(2 years ago)

428 2 29 AGPL-3.0

Feb 20 2022 157 (month) 0.1.3(1 year, 6 months ago)

Autoscraper project is made for automatic web scraping to make scraping easy. It gets a url or the html content of a web page and a list of sample data which we want to scrape from that page. This data can be text, url or any html tag value of that page. It learns the scraping rules and returns the similar elements. Then you can use this learned object with new urls to get similar content or the exact same element of those new pages.

Autoscraper is minimalistic and auto-generative approach to web scraping. For example, here's a scraper that finds all titles on a stackoverflow.com page:

from autoscraper import AutoScraper

url = 'https://stackoverflow.com/questions/2081586/web-scraping-with-python'

# We can add one or multiple candidates here.
# You can also put urls here to retrieve urls.
wanted_list = ["What are metaclasses in Python?"]

scraper = AutoScraper()
result = scraper.build(url, wanted_list)
print(result)

Dude (dude uncomplicated data extraction) is a very simple framework for writing web scrapers using Python decorators. The design, inspired by Flask, was to easily build a web scraper in just a few lines of code. Dude has an easy-to-learn syntax.

The simplest web scraper will look like this:

from dude import select


@select(css="a")
def get_link(element):
    return {"url": element.get_attribute("href")}

dude supports multiple parser backends: - playwright
- lxml
- parsel - beautifulsoup - pyppeteer - selenium

Highlights

popularminimalisticauto-generating

Example Use

from dude import select

"""
This example demonstrates how to use Parsel + async HTTPX
To access an attribute, use:
    selector.attrib["href"]
You can also access an attribute using the ::attr(name) pseudo-element, for example "a::attr(href)", then:
    selector.get()
To get the text, use ::text pseudo-element, then:
    selector.get()
"""


@select(css="a.url", priority=2)
async def result_url(selector):
    return {"url": selector.attrib["href"]}


# Option to get url using ::attr(name) pseudo-element
@select(css="a.url::attr(href)", priority=2)
async def result_url2(selector):
    return {"url2": selector.get()}


@select(css=".title::text", priority=1)
async def result_title(selector):
    return {"title": selector.get()}


@select(css=".description::text", priority=0)
async def result_description(selector):
    return {"description": selector.get()}


if __name__ == "__main__":
    import dude

    dude.run(urls=["https://dude.ron.sh"], parser="parsel")

Alternatives / Similar

colly

23,747 compare

pholcus

7,580 compare

geziyor

2,667 compare

dataflowkit

676 compare

scrapy

54,211 compare

rvest

1,498 compare

ferret

5,716 compare

gocrawl

2,039 compare

scrapyd

2,980 compare

node-crawler

6,733 compare

panther

2,977 compare

gracy

247 compare

spidr

813 compare

scrapydweb

3,218 compare

gerapy

3,365 compare

wombat

1,316 compare

ruia

1,754 compare

photon

11,149 compare

ralger

156 compare

roach

1,384 compare

dude

428 compare

ayakashi

213 compare

phpscraper

554 compare

php-spider

1,335 compare

crwlr-crawler

356 compare