Skip to content

dataflowkitvsrvest

BSD-3-Clause 4 3 651
Feb 09 2017 2024-06-25(20 days ago)
1,485 1 23 MIT
Nov 22 2014 483.1 thousand (month) 1.0.4(1 year, 10 months ago)

Dataflow kit ("DFK") is a Web Scraping framework for Gophers. It extracts data from web pages, following the specified CSS Selectors. You can use it in many ways for data mining, data processing or archiving.

Web-scraping pipeline consists of 3 general components:

  • Downloading an HTML web-page. (Fetch Service)
  • Parsing an HTML page and retrieving data we're interested in (Parse Service)
  • Encoding parsed data to CSV, MS Excel, JSON, JSON Lines or XML format.

For fetching dataflowkit has several types of page fetchers:

  • Base fetcher uses standard golang http client to fetch pages as is. It works faster than Chrome fetcher. But Base fetcher cannot render dynamic javascript driven web pages.
  • Chrome fetcher is intended for rendering dynamic javascript based content. It sends requests to Chrome running in headless mode.

For parsing dataflowkit extracts data from downloaded web page following the rules listed in configuration JSON file. Extracted data is returned in CSV, MS Excel, JSON or XML format.

Some dataflowkit features:

  • Scraping of JavaScript generated pages;
  • Data extraction from paginated websites;
  • Processing infinite scrolled pages.
  • Sсraping of websites behind login form;
  • Cookies and sessions handling;
  • Following links and detailed pages processing;
  • Managing delays between requests per domain;
  • Following robots.txt directives;
  • Saving intermediate data in Diskv or Mongodb. Storage interface is flexible enough to add more storage types easily;
  • Encode results to CSV, MS Excel, JSON(Lines), XML formats;
  • Dataflow kit is fast. It takes about 4-6 seconds to fetch and then parse 50 pages.
  • Dataflow kit is suitable to process quite large volumes of data. Our tests show the time needed to parse appr. 4 millions of pages is about 7 hours.

rvest is a popular R library for web scraping and parsing HTML and XML documents. It is built on top of the xml2 and httr libraries and provides a simple and consistent API for interacting with web pages.

One of the main advantages of using rvest is its simplicity and ease of use. It provides a number of functions that make it easy to extract information from web pages, even for those who are not familiar with web scraping. The html_nodes and html_node functions allow you to select elements from an HTML document using CSS selectors, similar to how you would select elements in JavaScript.

rvest also provides functions for interacting with forms, including html_form, set_values, and submit_form functions. These functions make it easy to navigate through forms and submit data to the server, which can be useful when scraping sites that require authentication or when interacting with dynamic web pages.

rvest also provides functions for parsing XML documents. It includes xml_nodes and xml_node functions, which also use CSS selectors to select elements from an XML document, as well as xml_attrs and xml_attr functions to extract attributes from elements.

Another advantage of rvest is that it provides a way to handle cookies, so you can keep the session alive while scraping a website, and also you can handle redirections with handle_redirects

Example Use


Dataflowkit uses JSON configuration like:
{
  "name": "collection",
  "request": {
      "url": "https://example.com"
  },
  "fields": [
      {
          "name": "Title",
          "selector": ".product-container a",
          "extractor": {
              "types": [
                  "text",
                  "href"
              ],
              "filters": [
                  "trim",
                  "lowerCase"
              ],
              "params": {
                  "includeIfEmpty": false
              }
          }
      },
      {
          "name": "Image",
          "selector": "#product-container img",
          "extractor": {
              "types": [
                  "alt",
                  "src",
                  "width",
                  "height"
              ],
              "filters": [
                  "trim",
                  "upperCase"
              ]
          }
      },
      {
          "name": "Buyinfo",
          "selector": ".buy-info",
          "extractor": {
              "types": [
                  "text"
              ],
              "params": {
                  "includeIfEmpty": false
              }
          }
      }
  ],
  "paginator": {
      "selector": ".next",
      "attr": "href",
      "maxPages": 3
  },
  "format": "json",
  "fetcherType": "chrome",
  "paginateResults": false
}
which is then ingested through CLI command.
library("rvest")

# Rvest can use basic HTTP client to download remote HTML:
tree <- read_html("http://webscraping.fyi/lib/r/rvest")
# or read from string:
tree <- read_html('
<div class="products">
  <a href="/product/1">Cat Food</a>
  <a href="/product/2">Dog Food</a>
</div>
')

# to parse HTML trees with rvest we use r pipes (the %>% symbol) and html_element function:
# we can use css selectors:
print(tree %>% html_element(".products>a") %>% html_text())
# "[1] "\nCat Food\nDog Food\n""

# or XPath:
print(tree %>% html_element(xpath="//div[@class='products']/a") %>% html_text())
# "[1] "\nCat Food\nDog Food\n""

# Additionally rvest offers many quality of life functions:
# html_text2 - removes trailing and leading spaces and joins values
print(tree %>% html_element("div") %>% html_text2())
# "[1] "Cat Food Dog Food""

# html_attr - selects element's attribute:
print(tree %>% html_element("div") %>% html_attr('class'))
# "products"

Alternatives / Similar


Was this page helpful?