Skip to content

dudevsphpscraper

AGPL-3.0 32 2 425
54 (month) Feb 20 2022 0.1.3(2023-08-01 20:28:33 ago)
583 2 28 GPL-3.0-or-later
May 04 2020 104 (month) 3.0.0(2024-04-09 15:34:59 ago)

Dude (dude uncomplicated data extraction) is a very simple framework for writing web scrapers using Python decorators. The design, inspired by Flask, was to easily build a web scraper in just a few lines of code. Dude has an easy-to-learn syntax.

The simplest web scraper will look like this: ```python from dude import select

@select(css="a") def get_link(element): return {"url": element.get_attribute("href")} ```

dude supports multiple parser backends: - playwright
- lxml
- parsel - beautifulsoup - pyppeteer - selenium

PHPScraper is a universal web-util for PHP. The main goal is to get stuff done instead of getting distracted with selectors, preparing & converting data structures, etc. Instead, you can just go to a website and get the relevant information for your project.

PHPScraper is a minimalistic scraper framework that is built on top of other popular scraping tools.

Features:

  • Direct access to page basic features like: Meta data, Links, Images, Headings, Content, Keywords etc.
  • File downloading.
  • RSS, Sitemap and other feed processing.
  • CSV, XML and JSON file processing.

Example Use


```python from dude import select """ This example demonstrates how to use Parsel + async HTTPX To access an attribute, use: selector.attrib["href"] You can also access an attribute using the ::attr(name) pseudo-element, for example "a::attr(href)", then: selector.get() To get the text, use ::text pseudo-element, then: selector.get() """ @select(css="a.url", priority=2) async def result_url(selector): return {"url": selector.attrib["href"]} # Option to get url using ::attr(name) pseudo-element @select(css="a.url::attr(href)", priority=2) async def result_url2(selector): return {"url2": selector.get()} @select(css=".title::text", priority=1) async def result_title(selector): return {"title": selector.get()} @select(css=".description::text", priority=0) async def result_description(selector): return {"description": selector.get()} if __name__ == "__main__": import dude dude.run(urls=["https://dude.ron.sh"], parser="parsel") ```
```javascript // create scraper object $web = new \Spekulatius\PHPScraper\PHPScraper; // go to URL $web->go('https://test-pages.phpscraper.de/content/selectors.html'); // elements can be found using XPath: echo $web->filter("//*[@id='by-id']")->text(); // "Content by ID" // or pre-defined variables covering basic page data: $web->links; // for all links $web->headings; $web->images; $web->contentKeywords; $web->orderedLists; $web->unorderedLists; $web->paragraphs; $web->outline; // basic page outline $web->cleanOutlineWithParagraphs; // basic page outline ```

Alternatives / Similar


Was this page helpful?