Skip to content

scrapyvsgerapy

BSD 652 30 50,703
1.6 million (month) Jul 26 2019 2.11.1(a month ago)
3,199 4 64 MIT
0.9.13(8 months ago) Jul 04 2017 545 (month)

Scrapy is an open-source Python library for web scraping. It allows developers to extract structured data from websites using a simple and consistent interface.

Scrapy provides:

  • A built-in way to follow links and extract data from multiple pages (crawling)
  • Handling common web scraping tasks such as logging in, handling cookies, and handling redirects.

Scrapy is built on top of the Twisted networking engine, which provides a non-blocking way to handle multiple requests at the same time, allowing Scrapy to efficiently scrape large websites.

It also comes with a built-in mechanism for handling common web scraping problems, such as:

  • handling HTTP errors
  • handling broken links

Scrapy also provide these features:

  • Support for storing scraped data in various formats, such as CSV, JSON, and XML.
  • Built-in support for selecting and extracting data using XPath or CSS selectors (through parsel).
  • Built-in support for handling common web scraping problems (like deduplication and url filtering).
  • Ability to easily extend its functionality using middlewares.
  • Ability to easily extend output processing using pipelines.

Gerapy is a Distributed Crawler Management Framework Based on Scrapy, Scrapyd, Scrapyd-Client, Scrapyd-API, Django and Vue.js.

It is built on top of the Scrapy framework and provides a simple and easy-to-use interface for performing web scraping tasks. Gerapy also includes features such as support for scheduling and distributed crawling, as well as a built-in web-based dashboard for monitoring and managing scraping tasks. Additionally, Gerapy is designed to be highly extensible, allowing users to easily create custom plugins and integrations.

Overall, Gerapy is a useful tool for those looking to automate web scraping tasks and extract data from websites.

Highlights


popularcss-selectorsxpath-selectorscommunity-toolsoutput-pipelinesmiddlewaresasyncproductionlarge-scale

Example Use


Alternatives / Similar