geziyorvsdude
Geziyor is a blazing fast web crawling and web scraping framework. It can be used to crawl websites and extract structured data from them. Geziyor is useful for a wide range of purposes such as data mining, monitoring and automated testing.
Features:
- JS Rendering
- 5.000+ Requests/Sec
- Caching (Memory/Disk/LevelDB)
- Automatic Data Exporting (JSON, CSV, or custom)
- Metrics (Prometheus, Expvar, or custom)
- Limit Concurrency (Global/Per Domain)
- Request Delays (Constant/Randomized)
- Cookies, Middlewares, robots.txt
- Automatic response decoding to UTF-8
- Proxy management (Single, Round-Robin, Custom)
Dude (dude uncomplicated data extraction) is a very simple framework for writing web scrapers using Python decorators. The design, inspired by Flask, was to easily build a web scraper in just a few lines of code. Dude has an easy-to-learn syntax.
The simplest web scraper will look like this:
from dude import select
@select(css="a")
def get_link(element):
return {"url": element.get_attribute("href")}
dude supports multiple parser backends:
- playwright
- lxml
- parsel
- beautifulsoup
- pyppeteer
- selenium
Example Use
// This example extracts all quotes from quotes.toscrape.com and exports to JSON file.
func main() {
geziyor.NewGeziyor(&geziyor.Options{
StartURLs: []string{"http://quotes.toscrape.com/"},
ParseFunc: quotesParse,
Exporters: []export.Exporter{&export.JSON{}},
}).Start()
}
func quotesParse(g *geziyor.Geziyor, r *client.Response) {
r.HTMLDoc.Find("div.quote").Each(func(i int, s *goquery.Selection) {
g.Exports <- map[string]interface{}{
"text": s.Find("span.text").Text(),
"author": s.Find("small.author").Text(),
}
})
if href, ok := r.HTMLDoc.Find("li.next > a").Attr("href"); ok {
g.Get(r.JoinURL(href), quotesParse)
}
}
from dude import select
"""
This example demonstrates how to use Parsel + async HTTPX
To access an attribute, use:
selector.attrib["href"]
You can also access an attribute using the ::attr(name) pseudo-element, for example "a::attr(href)", then:
selector.get()
To get the text, use ::text pseudo-element, then:
selector.get()
"""
@select(css="a.url", priority=2)
async def result_url(selector):
return {"url": selector.attrib["href"]}
# Option to get url using ::attr(name) pseudo-element
@select(css="a.url::attr(href)", priority=2)
async def result_url2(selector):
return {"url2": selector.get()}
@select(css=".title::text", priority=1)
async def result_title(selector):
return {"title": selector.get()}
@select(css=".description::text", priority=0)
async def result_description(selector):
return {"description": selector.get()}
if __name__ == "__main__":
import dude
dude.run(urls=["https://dude.ron.sh"], parser="parsel")