Skip to content

html5-parservsralger

Apache-2.0 1 1 700
17.6 thousand (month) Jun 03 2007 0.4.12(2023-11-19 15:09:54 ago)
165 1 3 MIT
Dec 22 2019 327 (month) 2.3.0(2021-03-18 00:10:00 ago)

html5-parser is a Python library for parsing HTML and XML documents.

A fast implementation of the HTML 5 parsing spec for Python. Parsing is done in C using a variant of the gumbo parser. The gumbo parse tree is then transformed into an lxml tree, also in C, yielding parse times that can be a thirtieth of the html5lib parse times. That is a speedup of 30x. This differs, for instance, from the gumbo python bindings, where the initial parsing is done in C but the transformation into the final tree is done in python.

It is built on top of the popular lxml library and provides a simple and intuitive API for working with the document's structure.

html5-parser uses the HTML5 parsing algorithm, which is more lenient and forgiving than the traditional XML-based parsing algorithm. This means that it can parse HTML documents with malformed or missing tags and still produce a usable parse tree.

To use html5-parser, you first need to install it via pip by running pip install html5-parser. Once it is installed, you can use the html5_parser.parse() function to parse an HTML document and create a parse tree. For example:

``` from html5_parser import parse

html_string = "Hello, World!" root = parse(html_string) print(root.tag) # html ` You can also use `html5_parser.parse() with file-like objects, bytes or file paths.

Once you have a parse tree, you can use the find() and findall() methods to search for elements in the document similar to BeautifulSoup.

html5-parser also supports searching using xpath, similar to lxml.

ralger is a small web scraping framework for R based on rvest and xml2.

It's goal to simplify basic web scraping and it provides a convenient and easy to use API.

It offers functions for retrieving pages, parsing HTML using CSS selectors, automatic table parsing and auto link, title, image and paragraph extraction.

Example Use


```python from html5_parser import parse html_string = "Hello, World!" root = parse(html_string) print(root.tag) # html body = root.find("body") # or find all print(body.text) # "Hello, World!" for el in root.findall("p"): print(el.text) # "Hello ```
```r library("ralger") url <- "http://www.shanghairanking.com/rankings/arwu/2021" # retrieve HTML and select elements using CSS selectors: best_uni <- scrap(link = url, node = "a span", clean = TRUE) head(best_uni, 5) #> [1] "Harvard University" #> [2] "Stanford University" #> [3] "University of Cambridge" #> [4] "Massachusetts Institute of Technology (MIT)" #> [5] "University of California, Berkeley" # ralger can also parse HTML attributes attributes <- attribute_scrap( link = "https://ropensci.org/", node = "a", # the a tag attr = "class" # getting the class attribute ) head(attributes, 10) # NA values are a tags without a class attribute #> [1] "navbar-brand logo" "nav-link" NA #> [4] NA NA "nav-link" #> [7] NA "nav-link" NA #> [10] NA # # ralger can automatically scrape tables: data <- table_scrap(link ="https://www.boxofficemojo.com/chart/top_lifetime_gross/?area=XWW") head(data) #> # A tibble: 6 × 4 #> Rank Title `Lifetime Gross` Year #> #> 1 1 Avatar $2,847,397,339 2009 #> 2 2 Avengers: Endgame $2,797,501,328 2019 #> 3 3 Titanic $2,201,647,264 1997 #> 4 4 Star Wars: Episode VII - The Force Awakens $2,069,521,700 2015 #> 5 5 Avengers: Infinity War $2,048,359,754 2018 #> 6 6 Spider-Man: No Way Home $1,901,216,740 2021 ```

Alternatives / Similar


Was this page helpful?