htmlparser2vsxmltodict

MIT 18 4 4,529

143.2 million (month) Aug 28 2011 10.0.0(8 months ago)

5,577 2 88 MIT

Jul 30 2007 62.5 million (month) 0.14.2(10 months ago)

htmlparser2 is a Node.js library for parsing HTML and XML documents. It works by building a tree of elements, similar to the Document Object Model (DOM) in web browsers. This allows you to easily traverse and manipulate the structure of the document.

htmlparser2 is a low-level html tree parser but it can still be useful in web scraping as it's a powerful tool for HTML restructuring and serialization.

xmltodict is a Python library that allows you to work with XML data as if it were JSON. It allows you to parse XML documents and convert them to dictionaries, which can then be easily manipulated using standard dictionary operations.

You can also use the library to convert a dictionary back into an XML document. xmltodict is built on top of the popular lxml library and provides a simple, intuitive API for working with XML data.

Note that despite using lxml conversion speeds can be quite slow for large XML documents and in web scraping this should be used to parse specific snippets instead of whole HTML documents.

xmltodict pairs well with JSON parsing tools like jmespath or jsonpath. Alternatively, it can be used in reverse mode to parse JSON documents using HTML parsing tools like CSS selectors and XPath.

It can be installed via pip by running pip install xmltodict command.

Example Use

const htmlparser = require("htmlparser2");
const parser = new htmlparser.Parser({
    onopentag: (name, attribs) => {
        console.log(`Opening tag: ${name}`);
    },
    ontext: (text) => {
        console.log(`Text: ${text}`);
    },
    onclosetag: (name) => {
        console.log(`Closing tag: ${name}`);
    }
}, {decodeEntities: true});

const html = "<p>Hello, <b>world</b>!</p>";
parser.write(html);
parser.end();

import xmltodict

xml_string = """
<book>
    <title>The Great Gatsby</title>
    <author>F. Scott Fitzgerald</author>
    <publisher>Charles Scribner's Sons</publisher>
    <publication_date>1925</publication_date>
</book>
"""

book_dict = xmltodict.parse(xml_string)
print(book_dict)
{'book': {'title': 'The Great Gatsby',
'author': 'F. Scott Fitzgerald',
'publisher': "Charles Scribner's Sons",
'publication_date': '1925'}}

# and to reverse:
book_xml = xmltodict.unparse(book_dict)
print(book_xml)

# the xml can be loaded and parsed using parsel or beautifulsoup:
from parsel import Selector
sel = Selector(book_xml)
print(sel.css('publication_date::text').get())
'1925'

Alternatives / Similar

sax-js

1,101 compare

parse5

3,698 compare

beautifulsoup

- compare

lxml

2,737 compare

xmltodict

5,577 compare

cheerio

28,873 compare

html5lib

1,153 compare

cssselect

293 compare

nokogiri

6,173 compare

feedparser

2,048 compare

pyquery

2,312 compare

parsel

1,187 compare

requests-html

13,780 compare

xml2

221 compare

rvest

1,498 compare

selectolax

1,186 compare

html5-php

1,638 compare

untangle

619 compare

domcrawler

3,985 compare

goquery

14,273 compare

xpath

699 compare

cascadia

717 compare

soup

2,191 compare

htmlquery

760 compare

html5-parser

683 compare

chompjs

202 compare

gazpacho

764 compare

embed

2,103 compare

chopper

22 compare

ralger

156 compare

untangle

619 compare