Skip to content

xml2vsrequests-html

MIT 62 3 213
675.4 thousand (month) Apr 20 2015 1.3.6(7 months ago)
13,506 2 210 MIT
0.10.0(5 years ago) Feb 25 2018 681.6 thousand (month)

The xml2 package is a binding to libxml2, making it easy to work with HTML and XML from R. The API is somewhat inspired by jQuery.

xml2 can be used to parse HTML documents using XPath selectors and is a successor to R's XML package with a few improvements:

  • xml2 takes care of memory management for you. It will automatically free the memory used by an XML document as soon as the last reference to it goes away.
  • xml2 has a very simple class hierarchy so don't need to think about exactly what type of object you have, xml2 will just do the right thing.
  • More convenient handling of namespaces in Xpath expressions - see xml_ns() and xml_ns_strip() to get started.

requests-html is a Python package that allows you to easily make HTTP requests and parse the HTML content of web pages. It is built on top of the popular requests package and uses the html parser from the lxml library, which makes it fast and efficient. This package is designed to provide a simple and convenient API for web scraping, and it supports features such as JavaScript rendering, CSS selectors, and form submissions.

It also offers a lot of functionalities such as cookie, session, and proxy support, which makes it an easy-to-use package for web scraping and web automation tasks.

In short requests-html offers:

  • Full JavaScript support!
  • CSS Selectors (a.k.a jQuery-style, thanks to PyQuery).
  • XPath Selectors, for the faint of heart.
  • Mocked user-agent (like a real web browser).
  • Automatic following of redirects.
  • Connection–pooling and cookie persistence.
  • The Requests experience you know and love, with magical parsing abilities.
  • Async Support

Example Use


library("xml2")
x <- read_xml("<foo> <bar> text <baz/> </bar> </foo>")
x

xml_name(x)
xml_children(x)
xml_text(x)
xml_find_all(x, ".//baz")

h <- read_html("<html><p>Hi <b>!")
h
xml_name(h)
from requests_html import HTMLSession

session = HTMLSession()
r = session.get('https://www.example.com')

# print the HTML content of the page
print(r.html.html)

# use CSS selectors to find specific elements on the page
title = r.html.find('title', first=True)
print(title.text)

Alternatives / Similar