Skip to content

photonvsroach

MIT 52 3 10,458
305 (month) Aug 24 2018 1.1.9(5 years ago)
1,315 2 14 MIT
v3.2.0(2 days ago) Dec 27 2021 199 (month)

Photon is a Python library for web scraping. It is designed to be lightweight and fast, and can be used to extract data from websites and web pages. Photon can extract the following data while crawling:

  • URLs (in-scope & out-of-scope)
  • URLs with parameters (example.com/gallery.php?id=2)
  • Intel (emails, social media accounts, amazon buckets etc.)
  • Files (pdf, png, xml etc.)
  • Secret keys (auth/API keys & hashes)
  • JavaScript files & Endpoints present in them
  • Strings matching custom regex pattern
  • Subdomains & DNS related data

The extracted information is saved in an organized manner or can be exported as json.

Roach is a complete web scraping toolkit for PHP. It is heavily inspired by the popular Scrapy package for Python.

Roach allows us to define spiders that crawl and scrape web documents. Roach isn’t just a simple crawler, but includes an entire pipeline to clean, persist and otherwise process extracted data as well.

Just like scrapy, Roach supports: - Middlewares - Item Pipelines - Extendibility through Plugins

It’s your all-in-one resource for web scraping in PHP.

Example Use


from photon import Photon

#Create a new Photon instance
ph = Photon()

#Extract data from a specific element of the website
url = "https://www.example.com"
selector = "div.main"
data = ph.get_data(url, selector)

#Print the extracted data
print(data)


#Extract data from multiple websites asynchronously
urls = ["https://www.example1.com", "https://www.example2.com"]
data = ph.get_data_async(urls)
<?php

use RoachPHP\Http\Response;
use RoachPHP\Spider\BasicSpider;

class RoachDocsSpider extends BasicSpider
{
    /**
     * @var string[]
     */
    public array $startUrls = [
        'https://roach-php.dev/docs/spiders'
    ];

    public function parse(Response $response): \Generator
    {
        $title = $response->filter('h1')->text();

        $subtitle = $response
            ->filter('main > div:nth-child(2) p:first-of-type')
            ->text();

        yield $this->item([
            'title' => $title,
            'subtitle' => $subtitle,
        ]);
    }
}

Alternatives / Similar