Skip to content

dudevscolly

AGPL-3.0 29 2 428
157 (month) Feb 20 2022 0.1.3(2 years ago)
23,747 5 200 Apache-2.0
May 14 2018 v2.1.0(5 years ago)

Dude (dude uncomplicated data extraction) is a very simple framework for writing web scrapers using Python decorators. The design, inspired by Flask, was to easily build a web scraper in just a few lines of code. Dude has an easy-to-learn syntax.

The simplest web scraper will look like this:

from dude import select


@select(css="a")
def get_link(element):
    return {"url": element.get_attribute("href")}

dude supports multiple parser backends: - playwright
- lxml
- parsel - beautifulsoup - pyppeteer - selenium

Colly is a popular web scraping library for the Go programming language. It's designed to be fast and easy to use, and it provides a simple and flexible API for traversing and extracting information from websites.

Colly supports:

  • Concurrent scraping with a simple API
  • Automatic handling of cookies and sessions
  • Automatic handling of redirects
  • Support for parsing HTML and XML
  • Support for parsing JSON and binary data
  • Support for custom storage (e.g. scraping results to a database)
  • Simple JavaScript rendering with Colly's built-in rendering engine.

Colly also provides several optional features, such as support for user-agents, delay between requests, rate-limiting and proxy usage.

Colly's API is quite simple, and it is easy to get started with basic web scraping tasks. It's a good choice for scraping moderate to heavy sites, and it can be useful for a wide range of use cases, such as data mining, content extraction, and more.

Additionally, you can use it together with Goquery, a library that allow you to make jquery like queries on HTML documents and it is often used together with Colly to ease the way of parsing the HTML.

Highlights


popularcss-selectorsxpath-selectorscommunity-toolsoutput-pipelinesmiddlewaresasyncproductionlarge-scale

Example Use


from dude import select

"""
This example demonstrates how to use Parsel + async HTTPX
To access an attribute, use:
    selector.attrib["href"]
You can also access an attribute using the ::attr(name) pseudo-element, for example "a::attr(href)", then:
    selector.get()
To get the text, use ::text pseudo-element, then:
    selector.get()
"""


@select(css="a.url", priority=2)
async def result_url(selector):
    return {"url": selector.attrib["href"]}


# Option to get url using ::attr(name) pseudo-element
@select(css="a.url::attr(href)", priority=2)
async def result_url2(selector):
    return {"url2": selector.get()}


@select(css=".title::text", priority=1)
async def result_title(selector):
    return {"title": selector.get()}


@select(css=".description::text", priority=0)
async def result_description(selector):
    return {"description": selector.get()}


if __name__ == "__main__":
    import dude

    dude.run(urls=["https://dude.ron.sh"], parser="parsel")
package main

import (
  "fmt"

  "github.com/gocolly/colly/v2"
)

func main() {
  // Instantiate default collector
  c := colly.NewCollector(
    // Visit only domains: hackerspaces.org, wiki.hackerspaces.org
    colly.AllowedDomains("hackerspaces.org", "wiki.hackerspaces.org"),
  )

  // On every a element which has href attribute call callback
  c.OnHTML("a[href]", func(e *colly.HTMLElement) {
    link := e.Attr("href")
    // Print link
    fmt.Printf("Link found: %q -> %s\n", e.Text, link)
    // Visit link found on page
    // Only those links are visited which are in AllowedDomains
    c.Visit(e.Request.AbsoluteURL(link))
  })

  // Before making a request print "Visiting ..."
  c.OnRequest(func(r *colly.Request) {
    fmt.Println("Visiting", r.URL.String())
  })

  // Start scraping on https://hackerspaces.org
  c.Visit("https://hackerspaces.org/")
}

Alternatives / Similar


Was this page helpful?