Skip to content

pholcusvscrwlr-crawler

Apache-2.0 6 1 7,516
Feb 15 2020 v1.3.4(4 years ago)
294 2 1 MIT
v1.6.1(12 days ago) Apr 18 2022 13 (month)

Pholcus is a minimalistic web crawler library written in the Go programming language. It is designed to be flexible and easy to use, and it supports concurrent, distributed, and modular crawling.

Note that Pholcus is documented and maintained in the Chinese language and has no english resources other than the code source itself.

This library provides kind of a framework and a lot of ready to use, so-called steps, that you can use as building blocks, to build your own crawlers and scrapers with.

Some features: - Crawler Politeness innocent (respecting robots.txt, throttling,...) - Load URLs using - a (PSR-18) HTTP client (default is of course Guzzle) - or a headless browser (chrome) to get source after Javascript execution - Get absolute links from HTML documents link - Get sitemaps from robots.txt and get all URLs from those sitemaps - Crawl (load) all pages of a website spider - Use cookies (or don't) cookie - Use any HTTP methods (GET, POST,...) and send any headers or body - Iterate over paginated list pages repeat - Extract data from: - HTML and also XML (using CSS selectors or XPath queries) - JSON (using dot notation) - CSV (map columns) - Extract schema.org structured data in JSON-LD format from HTML documents - Keep memory usage low by using PHP Generators muscle - Cache HTTP responses during development, so you don't have to load pages again and again after every code change - Get logs about what your crawler is doing (accepts any PSR-3 LoggerInterface)

Example Use


package main

import (
    "github.com/henrylee2cn/pholcus/exec"
    _ "github.com/henrylee2cn/pholcus/spider/standard" // standard spider
)

func main() {
    // create spider object
    spider := exec.NewSpider(exec.NewTask("demo", "https://www.example.com"))
    // add a callback for URL route by regex pattern. In this case it's any route:
    spider.AddRule(".*", "Parse")
    // Start spider
    spider.Start()
}

// define callback here
func Parse(self *exec.Spider, doc *goquery.Document) {
    // callbacks receive HTMl document reference and 
}
<?php
require_once 'vendor/autoload.php';

use Crwlr\Crawler;

$crawler = new Crawler();
$crawler->get('https://example.com', ['User-Agent' => 'webscraping.fyi']);


// more links can be followed:
$crawler->followLinks();

// and current page can be parsed:
$response = $crawler->response();
$title = $crawler->filter('title')->text();
echo $response->getContent();
```

Alternatives / Similar