ayakashivsautoscraper

AGPL-3.0-only 8 1 213

119 (month) Apr 18 2019 1.0.0-beta8.4(2 years ago)

6,638 2 1 MIT

Jul 26 2019 3.0 thousand (month) 1.1.14(3 years ago)

Ayakashi is a web scraping library for Node.js that allows developers to easily extract structured data from websites. It is built on top of the popular "puppeteer" library and provides a simple and intuitive API for defining and querying the structure of a website.

Features:

Powerful querying and data models
Ayakashi's way of finding things in the page and using them is done with props and domQL. Directly inspired by the relational database world (and SQL), domQL makes DOM access easy and readable no matter how obscure the page's structure is. Props are the way to package domQL expressions as re-usable structures which can then be passed around to actions or to be used as models for data extraction.
High level builtin actions
Ready made actions so you can focus on what matters. Easily handle infinite scrolling, single page navigation, events and more. Plus, you can always build your own actions, either from scratch or by composing other actions.
Preload code on pages
Need to include a bunch of code, a library you made or a 3rd party module and make it available on a page? Preloaders have you covered.

Autoscraper project is made for automatic web scraping to make scraping easy. It gets a url or the html content of a web page and a list of sample data which we want to scrape from that page. This data can be text, url or any html tag value of that page. It learns the scraping rules and returns the similar elements. Then you can use this learned object with new urls to get similar content or the exact same element of those new pages.

Autoscraper is minimalistic and auto-generative approach to web scraping. For example, here's a scraper that finds all titles on a stackoverflow.com page:

from autoscraper import AutoScraper

url = 'https://stackoverflow.com/questions/2081586/web-scraping-with-python'

# We can add one or multiple candidates here.
# You can also put urls here to retrieve urls.
wanted_list = ["What are metaclasses in Python?"]

scraper = AutoScraper()
result = scraper.build(url, wanted_list)
print(result)

Highlights

popularminimalisticauto-generating

Example Use

const ayakashi = require("ayakashi");
const myAyakashi = ayakashi.init();

// navigate the browser
await myAyakashi.goTo("https://example.com/product");

// parsing HTML
// first by defnining a selector
myAyakashi
    .select("productList")
    .where({class: {eq: "product-item"}});

// then executing selector on current HTML:
const productList = await myAyakashi.extract("productList");
console.log(productList);

Alternatives / Similar

colly

23,747 compare

pholcus

7,580 compare

geziyor

2,667 compare

puppeteer

89,751 compare

dataflowkit

676 compare

scrapy

54,211 compare

puppeteer-stealth

89,751 compare

rvest

1,498 compare

gocrawl

2,039 compare

ferret

5,716 compare

scrapyd

2,980 compare

node-crawler

6,733 compare

panther

2,977 compare

autoscraper

6,638 compare

gracy

247 compare

spidr

813 compare

scrapydweb

3,218 compare

gerapy

3,365 compare

wombat

1,316 compare

ruia

1,754 compare

photon

11,149 compare

ralger

156 compare

roach

1,384 compare

dude

428 compare

phpscraper

554 compare

php-spider

1,335 compare

crwlr-crawler

356 compare