ayakashivsphpscraper

AGPL-3.0-only 8 1 200

60 (month) Apr 18 2019 1.0.0-beta8.4(1 year, 17 days ago)

506 2 22 GPL-3.0-or-later

May 04 2020 104 (month) 3.0.0(3 months ago)

Ayakashi is a web scraping library for Node.js that allows developers to easily extract structured data from websites. It is built on top of the popular "puppeteer" library and provides a simple and intuitive API for defining and querying the structure of a website.

Features:

Powerful querying and data models
Ayakashi's way of finding things in the page and using them is done with props and domQL. Directly inspired by the relational database world (and SQL), domQL makes DOM access easy and readable no matter how obscure the page's structure is. Props are the way to package domQL expressions as re-usable structures which can then be passed around to actions or to be used as models for data extraction.
High level builtin actions
Ready made actions so you can focus on what matters. Easily handle infinite scrolling, single page navigation, events and more. Plus, you can always build your own actions, either from scratch or by composing other actions.
Preload code on pages
Need to include a bunch of code, a library you made or a 3rd party module and make it available on a page? Preloaders have you covered.

PHPScraper is a universal web-util for PHP. The main goal is to get stuff done instead of getting distracted with selectors, preparing & converting data structures, etc. Instead, you can just go to a website and get the relevant information for your project.

PHPScraper is a minimalistic scraper framework that is built on top of other popular scraping tools.

Features:

Direct access to page basic features like: Meta data, Links, Images, Headings, Content, Keywords etc.
File downloading.
RSS, Sitemap and other feed processing.
CSV, XML and JSON file processing.

Example Use

const ayakashi = require("ayakashi");
const myAyakashi = ayakashi.init();

// navigate the browser
await myAyakashi.goTo("https://example.com/product");

// parsing HTML
// first by defnining a selector
myAyakashi
    .select("productList")
    .where({class: {eq: "product-item"}});

// then executing selector on current HTML:
const productList = await myAyakashi.extract("productList");
console.log(productList);

// create scraper object
$web = new \Spekulatius\PHPScraper\PHPScraper;
// go to URL
$web->go('https://test-pages.phpscraper.de/content/selectors.html');

// elements can be found using XPath:
echo $web->filter("//*[@id='by-id']")->text();   // "Content by ID"

// or pre-defined variables covering basic page data:
$web->links;  // for all links
$web->headings;
$web->images;
$web->contentKeywords;
$web->orderedLists;
$web->unorderedLists;
$web->paragraphs;
$web->outline;  // basic page outline
$web->cleanOutlineWithParagraphs;  // basic page outline

Alternatives / Similar

colly

22,459 compare

pholcus

7,554 compare

geziyor

2,488 compare

puppeteer

87,609 compare

dataflowkit

651 compare

puppeteer-stealth

87,609 compare

scrapy

51,636 compare

rvest

1,485 compare

ferret

5,663 compare

gocrawl

2,029 compare

node-crawler

6,650 compare

scrapyd

2,884 compare

panther

2,907 compare

autoscraper

6,060 compare

spidr

795 compare

wombat

1,308 compare

scrapydweb

3,048 compare

ralger

153 compare

gracy

240 compare

gerapy

3,257 compare

ruia

1,737 compare

roach

1,337 compare

photon

10,672 compare

phpscraper

506 compare

php-spider

1,328 compare

dude

413 compare

crwlr-crawler

311 compare