📑 Table des matières

Scraping intelligent avec l'IA

Automatisation 🟡 Intermédiaire ⏱️ 15 min de lecture 📅 2026-02-24

Web scraping has long been an exercise in patience: inspecting HTML, writing fragile CSS selectors, handling dynamic JavaScript pages, and starting all over again when the site changes its structure. AI completely changes the game. Instead of parsing HTML, we can now understand the page and extract information intelligently.

In this guide, we explore AI-assisted scraping: how it works, which tools to use, concrete examples, and the ethical and legal considerations to be aware of.

🔄 Classic Scraping vs AI Scraping

Classic Scraping: Fragile and Laborious

Traditional scraping relies on analyzing the HTML structure:

# Classic scraping with BeautifulSoup
import requests
from bs4 import BeautifulSoup

response = requests.get("https://example.com/products")
soup = BeautifulSoup(response.text, "html.parser")

products = []
for card in soup.select("div.product-card"):
    name = card.select_one("h3.product-title").text.strip()
    price = card.select_one("span.price").text.strip()
    products.append({"name": name, "price": price})

# ❌ Problems:
# - If the site changes "div.product-card" → everything breaks
# - JavaScript pages → requests is not enough
# - Data in various formats → fragile regex
# - Anti-bot → headers, cookies, captchas

AI Scraping: Understanding Instead of Parsing

AI scraping uses an LLM to understand the content of the page, not its HTML structure:

# AI scraping: LLM understands the page
from openai import OpenAI

client = OpenAI()

# We retrieve the page content as text (no need to parse HTML)
page_content = fetch_page_as_text("https://example.com/products")

response = client.chat.completions.create(
    model="gpt-4o-mini",
    messages=[
        {"role": "system", "content": """Extract all products 
        from this page. For each product, provide:
        - name, price, description, availability
        Respond in JSON."""},
        {"role": "user", "content": page_content}
    ],
    response_format={"type": "json_object"}
)

products = json.loads(response.choices[0].message.content)
# ✅ Works even if the HTML structure changes
# ✅ Understands context ("out of stock", "promo -30%")
# ✅ Extracts implicit information

Detailed Comparison

Aspect Classic Scraping AI Scraping
Robustness ⭐⭐ Breaks if HTML changes ⭐⭐⭐⭐⭐ Understands content
Speed ⭐⭐⭐⭐⭐ Very fast ⭐⭐⭐ Slower (LLM call)
Cost ⭐⭐⭐⭐⭐ Almost free ⭐⭐⭐ LLM API cost
Precision ⭐⭐⭐⭐ If well-coded ⭐⭐⭐⭐ Very good on textual data
Structured Data ⭐⭐⭐ Requires work ⭐⭐⭐⭐⭐ Native JSON
Setup ⭐⭐ Slow (selectors, debug) ⭐⭐⭐⭐⭐ Fast (prompt + go)
Maintenance ⭐⭐ Breaks regularly ⭐⭐⭐⭐ Rarely needs update
Volume ⭐⭐⭐⭐⭐ Millions of pages ⭐⭐⭐ Limited by costs/rate limits

💡 The best approach: combine both. Use classic scraping to retrieve raw content quickly, then AI to extract and structure it intelligently.

🛠️ AI Scraping Tools

1. web_fetch + LLM: The Simple Method

The most direct method: retrieve a page's content as text/markdown, then have it analyzed by an LLM.

import httpx
from openai import OpenAI

client = OpenAI()

# Step 1: Retrieve the page as readable text
# (using a service like Jina Reader or a markdown extractor)
response = httpx.get(
    "https://r.jina.ai/https://example.com/article",
    headers={"Accept": "text/markdown"}
)
page_markdown = response.text

# Step 2: Extract data with LLM
extraction = client.chat.completions.create(
    model="gpt-4o-mini",
    messages=[
        {"role": "system", "content": """You are a data extractor.
        Extract requested information in JSON format.
        If information is not found, put null."""},
        {"role": "user", "content": f"""Web page:
{page_markdown}

Extract: title, author, date, summary (2 sentences), tags"""}
    ]
)
print(extraction.choices[0].message.content)

If you use OpenClaw, the web_fetch tool is natively integrated:

# OpenClaw can directly fetch + analyze
# The agent uses web_fetch then LLM to structure
User: "Retrieve the 5 latest articles from TechCrunch about AI"
# → OpenClaw fetches the page, extracts, structures in JSON

2. Browser Automation + AI: For Dynamic Sites

For sites that require JavaScript, scrolling, clicks, or authentication:

from playwright.sync_api import sync_playwright
from openai import OpenAI

client = OpenAI()

with sync_playwright() as p:
    browser = p.chromium.launch(headless=True)
    page = browser.new_page()

    # Navigate like a human
    page.goto("https://example.com/search")
    page.fill("#search-input", "artificial intelligence")
    page.click("button[type=submit]")
    page.wait_for_selector(".results")

    # Scroll to load more results
    for _ in range(3):
        page.evaluate("window.scrollTo(0, document.body.scrollHeight)")
        page.wait_for_timeout(2000)

    # Retrieve visible content
    content = page.inner_text("body")

    # AI to extract structured data
    response = client.chat.completions.create(
        model="gpt-4o-mini",
        messages=[
            {"role": "system", "content": """Extract all search results.
            JSON format: [{title, url, snippet, date}]"""},
            {"role": "user", "content": content}
        ]
    )

    browser.close()

OpenClaw integrates complete browser automation that can navigate, click, fill forms, and extract data automatically:

# OpenClaw browser automation
# The agent can control a browser + understand pages
User: "Go to LinkedIn, search for 'data engineer Paris' profiles"
# → OpenClaw opens the browser, navigates, extracts profiles

3. API + LLM: The Clean Method

Before scraping, always check if an API exists! Many sites offer one:

# Clean method: API + LLM to structure
import httpx
from openai import OpenAI

client = OpenAI()

# Public API (example: Hacker News)
stories = httpx.get(
    "https://hacker-news.firebaseio.com/v0/topstories.json"
).json()[:10]

articles = []
for story_id in stories:
    story = httpx.get(
        f"https://hacker-news.firebaseio.com/v0/item/{story_id}.json"
    ).json()
    articles.append(story)

# LLM to analyze and categorize
response = client.chat.completions.create(
    model="gpt-4o-mini",
    messages=[
        {"role": "system", "content": """Analyze these tech articles.
        For each, add: category (AI, web, hardware, other),
        relevance (1-10), summary in 1 sentence."""},
        {"role": "user", "content": json.dumps(articles, indent=2)}
    ]
)

4. Vision Models: Scraping Images and Screenshots

Vision models can "read" screenshots — useful for very visual sites or anti-scraping:

import base64
from openai import OpenAI
from playwright.sync_api import sync_playwright

client = OpenAI()

# Screenshot of the page
with sync_playwright() as p:
    browser = p.chromium.launch()
    page = browser.new_page()
    page.goto("https://example.com/dashboard")
    screenshot = page.screenshot()  # bytes PNG
    browser.close()

# Send to vision model
b64_image = base64.b64encode(screenshot).decode()
response = client.chat.completions.create(
    model="gpt-4o",
    messages=[{
        "role": "user",
        "content": [
            {"type": "text", "text": "Extract all data "
             "from this dashboard: metrics, graphs, tables. "
             "Structured JSON format."},
            {"type": "image_url", "image_url": {
                "url": f"data:image/png;base64,{b64_image}"
            }}
        ]
    }]
)

📋 Concrete Examples

Example 1: Competitive Monitoring

Automatically monitor competitors' prices and offers:

import json
import httpx
from openai import OpenAI
from datetime import datetime

client = OpenAI()

COMPETITORS = [
    {"name": "Competitor A", "url": "https://competitor-a.com/pricing"},
    {"name": "Competitor B", "url": "https://competitor-b.com/tariffs"},
    {"name": "Competitor C", "url": "https://competitor-c.com/plans"},
]

def analyze_pricing(name: str, url: str) -> dict:
    """Scrape and analyze a competitor's pricing page."""
    # Retrieve content
    resp = httpx.get(f"https://r.jina.ai/{url}")

    # AI extraction
    analysis = client.chat.completions.create(
        model="gpt-4o-mini",
        messages=[
            {"role": "system", "content": """Analyze this pricing page.
            Extract in JSON format:
            {
              "plans": [{"name": str, "monthly_price": float, 
                         "annual_price": float, "features": [str]}],
              "special_offer": str or null,
              "free_trial": bool,
              "notable_changes": str or null
            }"""},
            {"role": "user", "content": resp.text[:8000]}
        ]
    )

    result = json.loads(analysis.choices[0].message.content)
    result["competitor"] = name
    result["analysis_date"] = datetime.now().isoformat()
    return result

# Run analysis
report = [analyze_pricing(c["name"], c["url"]) for c in COMPETITORS]

# Save
with open(f"monitoring_{datetime.now():%Y-%m-%d}.json", "w") as f:
    json.dump(report, f, indent=2, ensure_ascii=False)

Example 2: News Aggregation

Create your own personalized AI news feed:

import feedparser
from openai import OpenAI

client = OpenAI()

SOURCES = [
    "https://techcrunch.com/feed/",
    "https://www.theverge.com/rss/index.xml",
    "https://feeds.arstechnica.com/arstechnica/index",
]

def aggregate_news(topic: str = "artificial intelligence"):
    articles = []

    for feed_url in SOURCES:
        feed = feedparser.parse(feed_url)
        for entry in feed.entries[:10]:
            articles.append({
                "title": entry.title,
                "link": entry.link,
                "source": feed.feed.title,
                "date": entry.get("published", ""),
                "summary": entry.get("summary", "")[:500]
            })

    # AI filtering and analysis
    response = client.chat.completions.create(
        model="gpt-4o-mini",
        messages=[
            {"role": "system", "content": f"""You are a news curator.
            Among the following articles, select those related to 
            "{topic}". For each retained article:
            - Summary in 2 sentences in French
            - Relevance score (1-10)
            - Tags (3 max)

            Sort by decreasing relevance. JSON format."""},
            {"role": "user", "content": json.dumps(articles, ensure_ascii=False)}
        ]
    )

    return json.loads(response.choices[0].message.content)

# Usage
news = aggregate_news("artificial intelligence")
for article in news[:5]:
    print(f"📰 {article['title']}")
    print(f"   {article['summary']}")
    print(f"   Relevance: {article['score']}/10")
    print()

Example 3: Price Monitoring

Monitor product prices and receive alerts:

import json
import sqlite3
from datetime import datetime
import httpx
from openai import OpenAI

client = OpenAI()

def monitor_price(url: str, product: str) -> dict:
    """Retrieve and analyze a product's price."""
    resp = httpx.get(f"https://r.jina.ai/{url}")

    extraction = client.chat.completions.create(
        model="gpt-4o-mini",
        # ... (rest of the code remains the same)