Web scraping has long been an exercise in patience: inspecting HTML, writing fragile CSS selectors, handling dynamic JavaScript pages, and starting all over again when the site changes its structure. AI completely changes the game. Instead of parsing HTML, we can now understand the page and extract information intelligently.
In this guide, we explore AI-assisted scraping: how it works, which tools to use, concrete examples, and the ethical and legal considerations to be aware of.
🔄 Classic Scraping vs AI Scraping
Classic Scraping: Fragile and Laborious
Traditional scraping relies on analyzing the HTML structure:
# Classic scraping with BeautifulSoup
import requests
from bs4 import BeautifulSoup
response = requests.get("https://example.com/products")
soup = BeautifulSoup(response.text, "html.parser")
products = []
for card in soup.select("div.product-card"):
name = card.select_one("h3.product-title").text.strip()
price = card.select_one("span.price").text.strip()
products.append({"name": name, "price": price})
# ❌ Problems:
# - If the site changes "div.product-card" → everything breaks
# - JavaScript pages → requests is not enough
# - Data in various formats → fragile regex
# - Anti-bot → headers, cookies, captchas
AI Scraping: Understanding Instead of Parsing
AI scraping uses an LLM to understand the content of the page, not its HTML structure:
# AI scraping: LLM understands the page
from openai import OpenAI
client = OpenAI()
# We retrieve the page content as text (no need to parse HTML)
page_content = fetch_page_as_text("https://example.com/products")
response = client.chat.completions.create(
model="gpt-4o-mini",
messages=[
{"role": "system", "content": """Extract all products
from this page. For each product, provide:
- name, price, description, availability
Respond in JSON."""},
{"role": "user", "content": page_content}
],
response_format={"type": "json_object"}
)
products = json.loads(response.choices[0].message.content)
# ✅ Works even if the HTML structure changes
# ✅ Understands context ("out of stock", "promo -30%")
# ✅ Extracts implicit information
Detailed Comparison
| Aspect | Classic Scraping | AI Scraping |
|---|---|---|
| Robustness | ⭐⭐ Breaks if HTML changes | ⭐⭐⭐⭐⭐ Understands content |
| Speed | ⭐⭐⭐⭐⭐ Very fast | ⭐⭐⭐ Slower (LLM call) |
| Cost | ⭐⭐⭐⭐⭐ Almost free | ⭐⭐⭐ LLM API cost |
| Precision | ⭐⭐⭐⭐ If well-coded | ⭐⭐⭐⭐ Very good on textual data |
| Structured Data | ⭐⭐⭐ Requires work | ⭐⭐⭐⭐⭐ Native JSON |
| Setup | ⭐⭐ Slow (selectors, debug) | ⭐⭐⭐⭐⭐ Fast (prompt + go) |
| Maintenance | ⭐⭐ Breaks regularly | ⭐⭐⭐⭐ Rarely needs update |
| Volume | ⭐⭐⭐⭐⭐ Millions of pages | ⭐⭐⭐ Limited by costs/rate limits |
💡 The best approach: combine both. Use classic scraping to retrieve raw content quickly, then AI to extract and structure it intelligently.
🛠️ AI Scraping Tools
1. web_fetch + LLM: The Simple Method
The most direct method: retrieve a page's content as text/markdown, then have it analyzed by an LLM.
import httpx
from openai import OpenAI
client = OpenAI()
# Step 1: Retrieve the page as readable text
# (using a service like Jina Reader or a markdown extractor)
response = httpx.get(
"https://r.jina.ai/https://example.com/article",
headers={"Accept": "text/markdown"}
)
page_markdown = response.text
# Step 2: Extract data with LLM
extraction = client.chat.completions.create(
model="gpt-4o-mini",
messages=[
{"role": "system", "content": """You are a data extractor.
Extract requested information in JSON format.
If information is not found, put null."""},
{"role": "user", "content": f"""Web page:
{page_markdown}
Extract: title, author, date, summary (2 sentences), tags"""}
]
)
print(extraction.choices[0].message.content)
If you use OpenClaw, the web_fetch tool is natively integrated:
# OpenClaw can directly fetch + analyze
# The agent uses web_fetch then LLM to structure
User: "Retrieve the 5 latest articles from TechCrunch about AI"
# → OpenClaw fetches the page, extracts, structures in JSON
2. Browser Automation + AI: For Dynamic Sites
For sites that require JavaScript, scrolling, clicks, or authentication:
from playwright.sync_api import sync_playwright
from openai import OpenAI
client = OpenAI()
with sync_playwright() as p:
browser = p.chromium.launch(headless=True)
page = browser.new_page()
# Navigate like a human
page.goto("https://example.com/search")
page.fill("#search-input", "artificial intelligence")
page.click("button[type=submit]")
page.wait_for_selector(".results")
# Scroll to load more results
for _ in range(3):
page.evaluate("window.scrollTo(0, document.body.scrollHeight)")
page.wait_for_timeout(2000)
# Retrieve visible content
content = page.inner_text("body")
# AI to extract structured data
response = client.chat.completions.create(
model="gpt-4o-mini",
messages=[
{"role": "system", "content": """Extract all search results.
JSON format: [{title, url, snippet, date}]"""},
{"role": "user", "content": content}
]
)
browser.close()
OpenClaw integrates complete browser automation that can navigate, click, fill forms, and extract data automatically:
# OpenClaw browser automation
# The agent can control a browser + understand pages
User: "Go to LinkedIn, search for 'data engineer Paris' profiles"
# → OpenClaw opens the browser, navigates, extracts profiles
3. API + LLM: The Clean Method
Before scraping, always check if an API exists! Many sites offer one:
# Clean method: API + LLM to structure
import httpx
from openai import OpenAI
client = OpenAI()
# Public API (example: Hacker News)
stories = httpx.get(
"https://hacker-news.firebaseio.com/v0/topstories.json"
).json()[:10]
articles = []
for story_id in stories:
story = httpx.get(
f"https://hacker-news.firebaseio.com/v0/item/{story_id}.json"
).json()
articles.append(story)
# LLM to analyze and categorize
response = client.chat.completions.create(
model="gpt-4o-mini",
messages=[
{"role": "system", "content": """Analyze these tech articles.
For each, add: category (AI, web, hardware, other),
relevance (1-10), summary in 1 sentence."""},
{"role": "user", "content": json.dumps(articles, indent=2)}
]
)
4. Vision Models: Scraping Images and Screenshots
Vision models can "read" screenshots — useful for very visual sites or anti-scraping:
import base64
from openai import OpenAI
from playwright.sync_api import sync_playwright
client = OpenAI()
# Screenshot of the page
with sync_playwright() as p:
browser = p.chromium.launch()
page = browser.new_page()
page.goto("https://example.com/dashboard")
screenshot = page.screenshot() # bytes PNG
browser.close()
# Send to vision model
b64_image = base64.b64encode(screenshot).decode()
response = client.chat.completions.create(
model="gpt-4o",
messages=[{
"role": "user",
"content": [
{"type": "text", "text": "Extract all data "
"from this dashboard: metrics, graphs, tables. "
"Structured JSON format."},
{"type": "image_url", "image_url": {
"url": f"data:image/png;base64,{b64_image}"
}}
]
}]
)
📋 Concrete Examples
Example 1: Competitive Monitoring
Automatically monitor competitors' prices and offers:
import json
import httpx
from openai import OpenAI
from datetime import datetime
client = OpenAI()
COMPETITORS = [
{"name": "Competitor A", "url": "https://competitor-a.com/pricing"},
{"name": "Competitor B", "url": "https://competitor-b.com/tariffs"},
{"name": "Competitor C", "url": "https://competitor-c.com/plans"},
]
def analyze_pricing(name: str, url: str) -> dict:
"""Scrape and analyze a competitor's pricing page."""
# Retrieve content
resp = httpx.get(f"https://r.jina.ai/{url}")
# AI extraction
analysis = client.chat.completions.create(
model="gpt-4o-mini",
messages=[
{"role": "system", "content": """Analyze this pricing page.
Extract in JSON format:
{
"plans": [{"name": str, "monthly_price": float,
"annual_price": float, "features": [str]}],
"special_offer": str or null,
"free_trial": bool,
"notable_changes": str or null
}"""},
{"role": "user", "content": resp.text[:8000]}
]
)
result = json.loads(analysis.choices[0].message.content)
result["competitor"] = name
result["analysis_date"] = datetime.now().isoformat()
return result
# Run analysis
report = [analyze_pricing(c["name"], c["url"]) for c in COMPETITORS]
# Save
with open(f"monitoring_{datetime.now():%Y-%m-%d}.json", "w") as f:
json.dump(report, f, indent=2, ensure_ascii=False)
Example 2: News Aggregation
Create your own personalized AI news feed:
import feedparser
from openai import OpenAI
client = OpenAI()
SOURCES = [
"https://techcrunch.com/feed/",
"https://www.theverge.com/rss/index.xml",
"https://feeds.arstechnica.com/arstechnica/index",
]
def aggregate_news(topic: str = "artificial intelligence"):
articles = []
for feed_url in SOURCES:
feed = feedparser.parse(feed_url)
for entry in feed.entries[:10]:
articles.append({
"title": entry.title,
"link": entry.link,
"source": feed.feed.title,
"date": entry.get("published", ""),
"summary": entry.get("summary", "")[:500]
})
# AI filtering and analysis
response = client.chat.completions.create(
model="gpt-4o-mini",
messages=[
{"role": "system", "content": f"""You are a news curator.
Among the following articles, select those related to
"{topic}". For each retained article:
- Summary in 2 sentences in French
- Relevance score (1-10)
- Tags (3 max)
Sort by decreasing relevance. JSON format."""},
{"role": "user", "content": json.dumps(articles, ensure_ascii=False)}
]
)
return json.loads(response.choices[0].message.content)
# Usage
news = aggregate_news("artificial intelligence")
for article in news[:5]:
print(f"📰 {article['title']}")
print(f" {article['summary']}")
print(f" Relevance: {article['score']}/10")
print()
Example 3: Price Monitoring
Monitor product prices and receive alerts:
import json
import sqlite3
from datetime import datetime
import httpx
from openai import OpenAI
client = OpenAI()
def monitor_price(url: str, product: str) -> dict:
"""Retrieve and analyze a product's price."""
resp = httpx.get(f"https://r.jina.ai/{url}")
extraction = client.chat.completions.create(
model="gpt-4o-mini",
# ... (rest of the code remains the same)