📑 Table of contents

Smart Scraping: Enrich Your Data with Hunter, Phantombuster & AI

Automatisation 🔴 Advanced ⏱️ 10 min read 📅 2026-05-05

🔄 Introduction: From Raw File to Clean Pipeline

You exported 2,000 contacts from LinkedIn, Hunter, and Phantombuster. The problem? Data comes in 3 different formats, columns don't match, duplicates are everywhere, and about 30% of emails are probably invalid.

Without cleaning, this file is worthless. A dirty sales lead file costs more than it earns — bouncing emails, wasted time on bad contacts, polluted CRM.

This guide shows you how to use the Hunter.io API, the Phantombuster API, and AI to turn raw data into a clean, enriched, and actionable file.


📋 Table of Contents

  1. The Problem with Raw Data
  2. Architecture of a Cleaning Pipeline
  3. Hunter.io API: Enrichment & Verification
  4. Phantombuster API: Automated Extraction
  5. Clean & Normalize with Python
  6. Enrich with AI
  7. Automate the Full Pipeline

1. The Problem with Raw Data

What Can Go Wrong

When you scrape data from multiple sources, here are the common issues:

Problem Example Impact
Duplicates Same contact 3 times Inflated volume, spam
Invalid emails jean.dupont@gmail (missing extension) Bounce, destroyed email reputation
Inconsistent formats Jean, jean, Mr. Jean Impossible to match
Missing data No job title, no company Hard to personalize
Outdated data Contact changed jobs Irrelevant message
Broken encoding Jean Dupont \xa0 Paris CSV errors, display issues

The Real Cost of a Dirty File

  • Bounce rate > 2%: your email domain risks being blacklisted
  • 50% duplicates: you think you have 1,000 leads, you really have 500
  • Unnormalized data: impossible to run scoring or segmentation

💡 Rule: invest 30% of your prospecting time in data cleaning. The remaining 70% (outreach) will be 3x more effective.


2. Architecture of a Cleaning Pipeline

A robust pipeline follows these steps in order:

EXTRACTION → DEDUPLICATION → NORMALIZATION → VERIFICATION → ENRICHMENT → EXPORT

Each step depends on the previous one. No enrichment before normalization, no verification before deduplication.

Tools per Step

Step Tool Automatable?
Extraction Phantombuster, Apify, Prospeo ✅ API + scheduling
Deduplication Python (pandas) ✅ Script
Normalization Python + regex ✅ Script
Email Verification Hunter.io, Prospeo ✅ API
AI Enrichment ChatGPT API, Claude API ✅ API
Export CSV, Google Sheets, CRM ✅ Script

3. Hunter.io API: Enrichment & Verification

Hunter.io exposes a complete REST API to automate email search, verification, and enrichment.

Get an API Key

# Create a Hunter.io account
# Go to Dashboard → API → Create API Key
# Store in .env
echo "HUNTER_API_KEY=*** >> .env

Domain Search: Find Company Emails

import requests

API_KEY="***"
domain = "stripe.com"

response = requests.get(
    "https://api.hunter.io/v2/domain-search",
    params={
        "domain": domain,
        "api_key": API_KEY,
        "limit": 50
    }
)

data = response.json()
for email in data.get("data", {}).get("emails", []):
    print(f"{email['value']} - {email['first_name']} {email['last_name']} - {email['position']}")

Email Verifier: Verify an Email

response = requests.get(
    "https://api.hunter.io/v2/email-verifier",
    params={
        "email": "[email protected]",
        "api_key": API_KEY
    }
)

result = response.json()["data"]
print(f"Result: {result['result']}")
print(f"Score: {result['score']}/100")
print(f"Format: {result.get('format', 'N/A')}")
print(f"SMTP: {result.get('smtp_server', 'N/A')}")

Possible results: deliverable, risky, undeliverable, unknown.

Company Enrichment: Enrich a Company

response = requests.get(
    "https://api.hunter.io/v2/company-enrichment",
    params={
        "domain": "stripe.com",
        "api_key": API_KEY
    }
)

company = response.json()["data"]
print(f"Name: {company['name']}")
print(f"Industry: {company.get('industry', 'N/A')}")
print(f"Employees: {company.get('employees', 'N/A')}")
print(f"Revenue: {company.get('annual_revenue', 'N/A')}")

API Costs

The Hunter.io API is included in all paid plans (starting at $34/month, pricing as of May 2026 — check hunter.io). The free plan gives 50 searches/month but does not include API access.

⚠️ Note: each API call consumes credits. Domain Search = 1 credit, Email Verifier = 1 credit, Company Enrichment = 1 credit.


4. Phantombuster API: Automated Extraction

Phantombuster offers a REST API to launch, monitor, and retrieve results from your Phantoms (automation scripts).

Get an API Key

# Dashboard → Settings → API
echo "PHANTOMBUSTER_API_KEY=*** >> .env

Launch a Phantom

import requests, time

API_KEY="***"
PHANTOM_ID = "your_phantom_id"  # Phantom ID in Phantombuster

# Launch the Phantom
response = requests.post(
    "https://api.phantombuster.com/api/v2/containers/launch",
    headers={"X-Phantombuster-Key": API_KEY},
    json={
        "id": PHANTOM_ID,
        "argument": {
            "searchUrl": "https://www.linkedin.com/search/results/people/?keywords=CTO%20SaaS%20France",
            "numberOfPages": 3
        }
    }
)

container_id = response.json()["containerId"]
print(f"Phantom launched: container {container_id}")

# Wait for completion (poll every 30 seconds)
while True:
    status = requests.get(
        f"https://api.phantombuster.com/api/v2/containers/fetch/{container_id}",
        headers={"X-Phantombuster-Key": API_KEY}
    ).json()

    if status.get("status") == "completed":
        print("Scraping complete!")
        result_url = status["resultObject"]["output"]
        print(f"Results: {result_url}")
        break
    elif status.get("status") == "failed":
        print(f"Error: {status.get('error', 'unknown')}")
        break

    print("In progress... waiting 30s")
    time.sleep(30)

Retrieve Results

import json

# Results are in JSON or CSV
results = requests.get(result_url).json()

# Typical LinkedIn Profile Scraper result
for profile in results[:5]:
    print(f"Name: {profile.get('firstName', 'N/A')} {profile.get('lastName', 'N/A')}")
    print(f"Title: {profile.get('jobTitle', 'N/A')}")
    print(f"Company: {profile.get('companyName', 'N/A')}")
    print(f"LinkedIn: {profile.get('profileUrl', 'N/A')}")
    print("---")

Chained Workflow

Phantombuster lets you chain multiple Phantoms via the API:

Phantom 1: LinkedIn Search Export (extraction)
    ↓ (output → input)
Phantom 2: LinkedIn Profile Scraper (enrichment)
    ↓ (output → input)
Phantom 3: AI Enricher (AI enrichment)
    ↓ (output → input)
Phantom 4: Export to Google Sheets

Each Phantom launches automatically when the previous one finishes.


5. Clean & Normalize with Python

Complete Cleaning Script

import pandas as pd
import re

# 1. Load data from each source
prospeo = pd.read_csv('prospeo_export.csv')
hunter = pd.read_csv('hunter_export.csv')
phantombuster = pd.read_csv('phantombuster_export.json')

# 2. Standardize columns
column_mapping = {
    'first_name': 'firstName',
    'prenom': 'firstName',
    'last_name': 'lastName',
    'nom': 'lastName',
    'email_address': 'email',
    'Email': 'email',
    'company_name': 'company',
    'Entreprise': 'company',
}

for df in [prospeo, hunter, phantombuster]:
    df.rename(columns=column_mapping, inplace=True)

# 3. Merge
all_data = pd.concat([prospeo, hunter, phantombuster], ignore_index=True)

# 4. Deduplication
all_data = all_data.drop_duplicates(subset='email')
all_data = all_data[all_data['email'].notna()]  # Remove empty emails

# 5. Normalization
all_data['email'] = all_data['email'].str.lower().str.strip()
all_data['firstName'] = all_data['firstName'].str.title().str.strip()
all_data['lastName'] = all_data['lastName'].str.title().str.strip()
all_data['company'] = all_data['company'].str.title().str.strip()

# 6. Basic email validation
email_regex = r'^[a-zA-Z0-9._%+-]+@[a-zA-Z0-9.-]+\.[a-zA-Z]{2,}$'
all_data['email_valid'] = all_data['email'].str.match(email_regex)

# 7. Remove invalid ones
all_data = all_data[all_data['email_valid']].drop(columns=['email_valid'])

# 8. Export
all_data.to_csv('leads_clean.csv', index=False)
print(f"Clean leads: {len(all_data)} (out of {len(prospeo) + len(hunter) + len(phantombuster)} raw)")

Typical Result

Clean leads: 847 (out of 1,250 raw)
  - 283 duplicates removed
  - 89 invalid emails removed
  - 31 empty emails removed

6. Enrich with AI

Once the data is clean, AI adds value:

Batch Enrichment

import openai

API_KEY="***"

leads = pd.read_csv('leads_clean.csv')

def enrich_lead(first_name, job_title, company):
    prompt = f"""Analyze this B2B lead and return a JSON with:
    - "icp_score" (1-10): ICP fit score
    - "intent_signal" (str): detected intent signal or "none"
    - "personalization_hook" (str): a personalization angle in 1 sentence

    Lead: {first_name}, {job_title}, {company}
    ICP: SMB SaaS, 20-200 employees, France"""

    response = openai.chat.completions.create(
        model="gpt-4o-mini",
        messages=[{"role": "user", "content": prompt}],
        response_format={"type": "json_object"}
    )

    return json.loads(response.choices[0].message.content)

# Process in batches of 50
batch_size = 50
for i in range(0, len(leads), batch_size):
    batch = leads.iloc[i:i+batch_size]
    for _, row in batch.iterrows():
        try:
            enrichment = enrich_lead(
                row['firstName'], row.get('jobTitle', ''), row['company']
            )
            leads.at[_, 'icp_score'] = enrichment['icp_score']
            leads.at[_, 'intent_signal'] = enrichment['intent_signal']
            leads.at[_, 'personalization_hook'] = enrichment['personalization_hook']
        except Exception as e:
            print(f"Error: {row['email']} - {e}")

    print(f"Batch {i//batch_size + 1}/{len(leads)//batch_size + 1} processed")
    # Pause to respect rate limits
    import time
    time.sleep(5)

leads.to_csv('leads_enriched.csv', index=False)

Enrichment Cost

With GPT-4o-mini (~$0.15 / 1M input tokens, pricing as of May 2026 — check openai.com/pricing):
- 1,000 leads: about $0.30-$0.50
- 10,000 leads: about $3-$5

Negligible cost compared to the value of enriched leads.


7. Automate the Full Pipeline

Final Architecture

PHANTOMBUSTER API          HUNTER API           PROSPEO API
                                                     
                                                     
   raw_linkedin.json      raw_hunter.csv       raw_prospeo.csv
                                                     
       └────────────────────────┼──────────────────────┘
                                
                          [CLEAN SCRIPT]
                                
                          leads_clean.csv
                                
                    [HUNTER API VERIFIER]
                                
                          leads_verified.csv
                                
                    [AI ENRICHMENT - GPT-4o-mini]
                                
                          leads_enriched.csv
                                
                    [SEGMENTATION - Python]
                                
                    ┌───────────┼───────────┐
                                          
                hot.csv     warm.csv    cold.csv
                                          
                [CRM]      [Nurture]    [Later]

Automation Script with Scheduling

"""
pipeline.py - Run the full cleaning pipeline.
Schedule via crontab or systemd timer.
"""
import subprocess, datetime

def run_step(name, command):
    print(f"\n{'='*50}")
    print(f"{datetime.datetime.now()} - {name}")
    print(f"{'='*50}")
    result = subprocess.run(command, shell=True, capture_output=True, text=True)
    if result.returncode != 0:
        print(f"ERROR: {result.stderr}")
        return False
    print(result.stdout)
    return True

# Pipeline
steps = [
    ("Phantombuster Extraction", "python3 extract_phantombuster.py"),
    ("Hunter Extraction", "python3 extract_hunter.py"),
    ("Prospeo Extraction", "python3 extract_prospeo.py"),
    ("Cleaning", "python3 clean_data.py"),
    ("Email Verification", "python3 verify_emails.py"),
    ("AI Enrichment", "python3 enrich_with_ai.py"),
    ("Segmentation", "python3 segment_leads.py"),
    ("CRM Export", "python3 export_to_crm.py"),
]

for name, command in steps:
    if not run_step(name, command):
        print(f"Pipeline stopped at step: {name}")
        break
else:
    print("\n✅ Pipeline completed successfully!")

Scheduling

# Run pipeline every Monday at 6 AM
crontab -e
# Add:
0 6 * * 1 cd /path/to/pipeline && python3 pipeline.py >> pipeline.log 2>&1

Result: every Monday morning, your lead file is clean, verified, enriched, and segmented — with zero manual intervention.



✅ Conclusion

Dirty data is money wasted. A bounce rate above 2% destroys your email reputation, duplicates skew your KPIs, and unnormalized data makes any serious automation impossible.

The 5-step pipeline:
1. Extraction: Phantombuster + Hunter.io + Prospeo.io
2. Cleaning: deduplication + pandas normalization
3. Verification: Hunter.io Email Verifier API
4. AI Enrichment: ICP scoring + intent signals via GPT-4o-mini
5. Segmentation: hot / warm / cold → CRM or nurture

Estimated cost: a few dollars to enrich 1,000 leads via AI — negligible compared to the value of a clean pipeline.

Next step: your data is clean and enriched. Now, use your ICP to identify the best prospects, then reach out with personalized messages.


One verified and enriched lead is worth 10 raw leads. Invest in quality, not volume.