ayakashivsscrapy
Ayakashi is a web scraping library for Node.js that allows developers to easily extract structured data from websites. It is built on top of the popular "puppeteer" library and provides a simple and intuitive API for defining and querying the structure of a website.
Features:
- Powerful querying and data models
Ayakashi's way of finding things in the page and using them is done with props and domQL. Directly inspired by the relational database world (and SQL), domQL makes DOM access easy and readable no matter how obscure the page's structure is. Props are the way to package domQL expressions as re-usable structures which can then be passed around to actions or to be used as models for data extraction. - High level builtin actions
Ready made actions so you can focus on what matters. Easily handle infinite scrolling, single page navigation, events and more. Plus, you can always build your own actions, either from scratch or by composing other actions. - Preload code on pages
Need to include a bunch of code, a library you made or a 3rd party module and make it available on a page? Preloaders have you covered.
Scrapy is an open-source Python library for web scraping. It allows developers to extract structured data from websites using a simple and consistent interface.
Scrapy provides:
- A built-in way to follow links and extract data from multiple pages (crawling)
- Handling common web scraping tasks such as logging in, handling cookies, and handling redirects.
Scrapy is built on top of the Twisted networking engine, which provides a non-blocking way to handle multiple requests at the same time, allowing Scrapy to efficiently scrape large websites.
It also comes with a built-in mechanism for handling common web scraping problems, such as:
- handling HTTP errors
- handling broken links
Scrapy also provide these features:
- Support for storing scraped data in various formats, such as CSV, JSON, and XML.
- Built-in support for selecting and extracting data using XPath or CSS selectors (through
parsel
). - Built-in support for handling common web scraping problems (like deduplication and url filtering).
- Ability to easily extend its functionality using middlewares.
- Ability to easily extend output processing using pipelines.
Highlights
Example Use
const ayakashi = require("ayakashi");
const myAyakashi = ayakashi.init();
// navigate the browser
await myAyakashi.goTo("https://example.com/product");
// parsing HTML
// first by defnining a selector
myAyakashi
.select("productList")
.where({class: {eq: "product-item"}});
// then executing selector on current HTML:
const productList = await myAyakashi.extract("productList");
console.log(productList);