Gracy is an API client library based on httpx that provides an extra stability layer with:
- Retry logic
- Logging
- Connection throttling
- Tracking/Middleware
In web scraping, Gracy can be a convenient tool for creating scraper based API clients.
Kimurai is a modern web scraping framework for Ruby, inspired by Python's Scrapy. It provides
a structured approach to building web scrapers with built-in support for multiple browser
engines, session management, and data pipelines.
Key features include:
- Multiple engine support
Can use different backends depending on the scraping needs: Mechanize for simple HTTP
requests, Selenium with headless Chrome/Firefox for JavaScript-rendered pages, and
Poltergeist (PhantomJS) for lightweight rendering.
- Scrapy-like architecture
Follows the spider pattern: define a spider class with start URLs and parsing methods,
and the framework handles crawling, scheduling, and data collection.
- Built-in data pipelines
Save scraped data to JSON, CSV, or custom formats with configurable output pipelines.
- Session management
Maintains browser sessions with automatic cookie handling and configurable delays
between requests.
- Request scheduling
Built-in request queue with configurable concurrency, delays, and retry logic.
- CLI tools
Command-line tools for generating new spiders, running individual spiders, and
managing scraping projects.
Kimurai is the closest Ruby equivalent to Scrapy. It's well-suited for structured
scraping projects that need organization, multiple spiders, and data pipeline processing.
Note: Kimurai has not seen active development recently, but it remains a useful
framework for Ruby scraping projects and is included as the most complete Ruby
scraping framework available.
```python
# 0. Import
import asyncio
from typing import Awaitable
from gracy import BaseEndpoint, Gracy, GracyConfig, LogEvent, LogLevel
# 1. Define your endpoints
class PokeApiEndpoint(BaseEndpoint):
GET_POKEMON = "/pokemon/{NAME}" # 👈 Put placeholders as needed
# 2. Define your Graceful API
class GracefulPokeAPI(Gracy[str]):
class Config: # type: ignore
BASE_URL = "https://pokeapi.co/api/v2/" # 👈 Optional BASE_URL
# 👇 Define settings to apply for every request
SETTINGS = GracyConfig(
log_request=LogEvent(LogLevel.DEBUG),
log_response=LogEvent(LogLevel.INFO, "{URL} took {ELAPSED}"),
parser={
"default": lambda r: r.json()
}
)
async def get_pokemon(self, name: str) -> Awaitable[dict]:
return await self.get(PokeApiEndpoint.GET_POKEMON, {"NAME": name})
# Note: since Gracy is based on httpx we can customized the used client with custom headers etc"
def _create_client(self) -> httpx.AsyncClient:
client = super()._create_client()
client.headers = {"User-Agent": f"My Scraper"}
return client
pokeapi = GracefulPokeAPI()
async def main():
try:
pokemon = await pokeapi.get_pokemon("pikachu")
print(pokemon)
finally:
pokeapi.report_status("rich")
asyncio.run(main())
```
```ruby
require 'kimurai'
class ProductSpider < Kimurai::Base
@name = 'product_spider'
@engine = :selenium_chrome # or :mechanize for simple pages
@start_urls = ['https://example.com/products']
def parse(response, url:, data: {})
# Extract product data from current page
response.css('.product').each do |product|
item = {
name: product.css('.name').text.strip,
price: product.css('.price').text.strip,
url: absolute_url(product.at_css('a')['href'], base: url),
}
# Send item to the pipeline
save_to "products.json", item, format: :json
end
# Follow pagination links
if next_page = response.at_css('a.next-page')
request_to :parse, url: absolute_url(next_page['href'], base: url)
end
end
end
# Run the spider
ProductSpider.crawl!
```