Skip to content

scrapyvscrwlr-crawler

BSD 652 30 50,703
1.6 million (month) Jul 26 2019 2.11.1(a month ago)
298 2 2 MIT
v1.7.2(17 days ago) Apr 18 2022 12 (month)

Scrapy is an open-source Python library for web scraping. It allows developers to extract structured data from websites using a simple and consistent interface.

Scrapy provides:

  • A built-in way to follow links and extract data from multiple pages (crawling)
  • Handling common web scraping tasks such as logging in, handling cookies, and handling redirects.

Scrapy is built on top of the Twisted networking engine, which provides a non-blocking way to handle multiple requests at the same time, allowing Scrapy to efficiently scrape large websites.

It also comes with a built-in mechanism for handling common web scraping problems, such as:

  • handling HTTP errors
  • handling broken links

Scrapy also provide these features:

  • Support for storing scraped data in various formats, such as CSV, JSON, and XML.
  • Built-in support for selecting and extracting data using XPath or CSS selectors (through parsel).
  • Built-in support for handling common web scraping problems (like deduplication and url filtering).
  • Ability to easily extend its functionality using middlewares.
  • Ability to easily extend output processing using pipelines.

This library provides kind of a framework and a lot of ready to use, so-called steps, that you can use as building blocks, to build your own crawlers and scrapers with.

Some features: - Crawler Politeness innocent (respecting robots.txt, throttling,...) - Load URLs using - a (PSR-18) HTTP client (default is of course Guzzle) - or a headless browser (chrome) to get source after Javascript execution - Get absolute links from HTML documents link - Get sitemaps from robots.txt and get all URLs from those sitemaps - Crawl (load) all pages of a website spider - Use cookies (or don't) cookie - Use any HTTP methods (GET, POST,...) and send any headers or body - Iterate over paginated list pages repeat - Extract data from: - HTML and also XML (using CSS selectors or XPath queries) - JSON (using dot notation) - CSV (map columns) - Extract schema.org structured data in JSON-LD format from HTML documents - Keep memory usage low by using PHP Generators muscle - Cache HTTP responses during development, so you don't have to load pages again and again after every code change - Get logs about what your crawler is doing (accepts any PSR-3 LoggerInterface)

Highlights


popularcss-selectorsxpath-selectorscommunity-toolsoutput-pipelinesmiddlewaresasyncproductionlarge-scale

Example Use


<?php
require_once 'vendor/autoload.php';

use Crwlr\Crawler;

$crawler = new Crawler();
$crawler->get('https://example.com', ['User-Agent' => 'webscraping.fyi']);


// more links can be followed:
$crawler->followLinks();

// and current page can be parsed:
$response = $crawler->response();
$title = $crawler->filter('title')->text();
echo $response->getContent();
```

Alternatives / Similar