cascadiavsrequests-html
cascadia is a library for Go that provides a CSS selector engine, allowing you to use CSS selectors to select elements from an HTML document.
It is built on top of the html package in the Go standard library, and provides a more efficient and powerful way to select elements from an HTML document.
requests-html is a Python package that allows you to easily make HTTP requests and parse the HTML content of web pages. It is built on top of the popular requests package and uses the html parser from the lxml library, which makes it fast and efficient. This package is designed to provide a simple and convenient API for web scraping, and it supports features such as JavaScript rendering, CSS selectors, and form submissions.
It also offers a lot of functionalities such as cookie, session, and proxy support, which makes it an easy-to-use package for web scraping and web automation tasks.
In short requests-html offers:
- Full JavaScript support!
- CSS Selectors (a.k.a jQuery-style, thanks to PyQuery).
- XPath Selectors, for the faint of heart.
- Mocked user-agent (like a real web browser).
- Automatic following of redirects.
- Connection–pooling and cookie persistence.
- The Requests experience you know and love, with magical parsing abilities.
- Async Support
Example Use
package main
import (
"fmt"
"github.com/andybalholm/cascadia"
"golang.org/x/net/html"
"strings"
)
func main() {
// Create an HTML string
html := `<html>
<body>
<div id="content">
<p>Hello, World!</p>
<a href="http://example.com">Example</a>
</div>
</body>
</html>`
// Parse the HTML string into a node tree
doc, err := html.Parse(strings.NewReader(html))
if err != nil {
fmt.Println("Error:", err)
return
}
// Compile the CSS selector
sel, err := cascadia.Compile("p")
if err != nil {
fmt.Println("Error:", err)
return
}
// Use the Selector.Match method to select elements from the document
matches := sel.Match(doc)
if len(matches) > 0 {
fmt.Println(matches[0].FirstChild.Data)
// > Hello, World!
}
}
from requests_html import HTMLSession
session = HTMLSession()
r = session.get('https://www.example.com')
# print the HTML content of the page
print(r.html.html)
# use CSS selectors to find specific elements on the page
title = r.html.find('title', first=True)
print(title.text)