Ruia is an async web scraping micro-framework, written with asyncio and aiohttp,
aims to make crawling url as convenient as possible.
Ruia is inspired by scrapy however instead of Twisted it's based entirely on asyncio and aiohttp.
It also supports various features like cookies, headers, and proxy, which makes it very useful in dealing with complex web scraping tasks.
Mechanize is a Ruby library for automating interaction with websites. It automatically
stores and sends cookies, follows redirects, and can submit forms — making it behave
like a web browser without needing an actual browser engine.
Key features include:
- Automatic cookie management
Stores cookies received from servers and sends them back on subsequent requests,
maintaining session state across multiple pages.
- Form handling
Can find, fill in, and submit HTML forms programmatically. Supports text inputs,
selects, checkboxes, radio buttons, and file uploads.
- Link following
Navigate through pages by clicking links using their text content, CSS selectors,
or href patterns.
- History and back/forward
Maintains a browsing history, allowing you to go back and forward through visited pages.
- HTTP authentication
Supports basic and digest HTTP authentication.
- Proxy support
Can route requests through HTTP proxies.
- Redirect handling
Automatically follows HTTP redirects (configurable).
Mechanize is one of the oldest and most established web interaction libraries in Ruby.
It is best suited for scraping traditional server-rendered websites with forms and
multi-page workflows. For JavaScript-heavy sites, a browser automation tool like
Selenium or Playwright is recommended instead.
```python
#!/usr/bin/env python
"""
Target: https://news.ycombinator.com/
pip install aiofiles
"""
import aiofiles
from ruia import AttrField, Item, Spider, TextField
class HackerNewsItem(Item):
target_item = TextField(css_select="tr.athing")
title = TextField(css_select="a.storylink")
url = AttrField(css_select="a.storylink", attr="href")
async def clean_title(self, value):
return value.strip()
class HackerNewsSpider(Spider):
start_urls = [
"https://news.ycombinator.com/news?p=1",
"https://news.ycombinator.com/news?p=2",
]
concurrency = 10
# aiohttp_kwargs = {"proxy": "http://0.0.0.0:1087"}
async def parse(self, response):
async for item in HackerNewsItem.get_items(html=await response.text()):
yield item
async def process_item(self, item: HackerNewsItem):
async with aiofiles.open("./hacker_news.txt", "a") as f:
self.logger.info(item)
await f.write(str(item.title) + "\n")
if __name__ == "__main__":
HackerNewsSpider.start(middleware=None)
```
```ruby
require 'mechanize'
agent = Mechanize.new
# Navigate to a page
page = agent.get('https://example.com')
puts page.title
# Find and click a link
page = page.link_with(text: 'Products').click
# Extract data from the page
page.search('.product').each do |product|
name = product.at('.name').text
price = product.at('.price').text
puts "#{name}: #{price}"
end
# Fill in and submit a login form
login_page = agent.get('https://example.com/login')
form = login_page.form_with(action: '/login')
form['username'] = 'user@example.com'
form['password'] = 'password123'
dashboard = agent.submit(form)
# Cookies are maintained automatically
puts dashboard.title # "Dashboard"
# Download a file
agent.get('https://example.com/report.csv').save('report.csv')
```