pholcusvsphpscraper
Pholcus is a minimalistic web crawler library written in the Go programming language. It is designed to be flexible and easy to use, and it supports concurrent, distributed, and modular crawling.
Note that Pholcus is documented and maintained in the Chinese language and has no english resources other than the code source itself.
PHPScraper is a universal web-util for PHP. The main goal is to get stuff done instead of getting distracted with selectors, preparing & converting data structures, etc. Instead, you can just go to a website and get the relevant information for your project.
PHPScraper is a minimalistic scraper framework that is built on top of other popular scraping tools.
Features:
- Direct access to page basic features like: Meta data, Links, Images, Headings, Content, Keywords etc.
- File downloading.
- RSS, Sitemap and other feed processing.
- CSV, XML and JSON file processing.
Example Use
package main
import (
"github.com/henrylee2cn/pholcus/exec"
_ "github.com/henrylee2cn/pholcus/spider/standard" // standard spider
)
func main() {
// create spider object
spider := exec.NewSpider(exec.NewTask("demo", "https://www.example.com"))
// add a callback for URL route by regex pattern. In this case it's any route:
spider.AddRule(".*", "Parse")
// Start spider
spider.Start()
}
// define callback here
func Parse(self *exec.Spider, doc *goquery.Document) {
// callbacks receive HTMl document reference and
}
// create scraper object
$web = new \Spekulatius\PHPScraper\PHPScraper;
// go to URL
$web->go('https://test-pages.phpscraper.de/content/selectors.html');
// elements can be found using XPath:
echo $web->filter("//*[@id='by-id']")->text(); // "Content by ID"
// or pre-defined variables covering basic page data:
$web->links; // for all links
$web->headings;
$web->images;
$web->contentKeywords;
$web->orderedLists;
$web->unorderedLists;
$web->paragraphs;
$web->outline; // basic page outline
$web->cleanOutlineWithParagraphs; // basic page outline