Smart。您提供了一段需要翻译的标题文本。以下是按照您的要求（保留格式、不翻译工具名）的英文翻译： Smart scraping: enrich your data with Hunter, Phantombuster and AI

Automatisation 🔴 Advanced ⏱️ 10 min read 📅 2026-05-05

🔄 Introduction: from raw file to clean pipeline

You exported 2,000 contacts from LinkedIn, Hunter, and Phantombuster. The problem? The data is in 3 different formats, the columns don't match, there are duplicates everywhere, and 30% of the emails are probably invalid.

Without cleaning, this file is worthless. A sales lead file costs more than it earns — bouncing emails, wasted time on bad contacts, polluted CRM.

This guide shows you how to use the Hunter.io API, the Phantombuster API, and AI to transform raw data into a clean, enriched, and actionable file. To understand the basics of data extraction, check out our guide on AI smart scraping.

📋 The essentials

A cleaning pipeline follows 6 steps: extraction, deduplication, normalization, verification, enrichment, and export.
The Hunter.io API allows you to verify email deliverability and enrich company data (industry, headcount, revenue).
The Phantombuster API is used to automatically extract LinkedIn profiles and chain multiple automation scripts.
AI enrichment (GPT-4o-mini) costs about $0.30 to $0.50 per 1,000 leads and adds ICP scoring as well as intent signals.
Full automation via crontab allows you to get a clean, verified, and segmented file every week without manual intervention.

1. The problem with raw data

What can go wrong

When you scrape data from multiple sources, here are the common problems:

Problem	Example	Impact
Duplicates	Same contact 3 times	Inflates volume, spam
Invalid emails	`jean.dupont@gmail` (missing extension)	Bounce, destroyed email reputation
Inconsistent formats	`Jean`, `jean`, `M. Jean`	Impossible to match
Missing data	No job title, no company	Hard to personalize
Outdated data	Contact changed jobs	Irrelevant message
Broken encoding	`Jean Dupont \xa0 Paris`	CSV errors, display issues

The real cost of a dirty file

Bounce rate > 2%: your email domain risks being blacklisted
50% duplicates: you think you have 1,000 leads, you actually have 500
Unnormalized data: impossible to do scoring or segmentation

💡 Rule: invest 30% of your prospecting time in data cleaning. The remaining 70% (outreach) will be 3x more effective.

2. Architecture of a cleaning pipeline

A robust pipeline follows these steps in order:

EXTRACTION → DEDUPLICATION → NORMALIZATION → VERIFICATION → ENRICHMENT → EXPORT

Each step depends on the previous one. No enrichment before normalization, no verification before deduplication.

Tools by step

Step	Tool	Automatable?
Extraction	Phantombuster, Apify, Prospeo	✅ API + scheduling
Deduplication	Python + pandas	✅ Script
Normalization	Python + regex	✅ Script
Email verification	Hunter.io, Prospeo	✅ API
AI enrichment	OpenAI (GPT-4o-mini), Claude API	✅ API
Export	CSV, Google Sheets, CRM	✅ Script

3. Hunter.io API: enrichment and verification

Hunter.io exposes a comprehensive REST API to automate email searching, verification, and enrichment.

Getting an API key

To get started, create an account on Hunter.io, then go to Dashboard > API > Create API Key. Store this key in a .env file at the root of your project to secure your access.

Domain Search: finding a company's emails

The Domain Search endpoint lists the public email addresses of a given domain. It returns the first name, last name, and job title for each email, making it easy to filter by role. A single API call is enough to retrieve up to 50 emails associated with a company like Stripe.

Email Verifier: verifying an email

The Email Verifier analyzes an email address based on several criteria: syntax format, domain existence, and SMTP verification. It returns a deliverability score out of 100 and a status among deliverable, risky, undeliverable, or unknown. This tool is essential for eliminating emails that would destroy your sender reputation.

Company Enrichment: enriching a company

The Company Enrichment endpoint takes a domain as input and returns structured information about the company: name, industry, number of employees, and estimated annual revenue. This data allows you to quickly qualify a lead according to your ICP (ideal customer).

API costs

The Hunter.io API is included in all paid plans (starting at $34/month, price as of May 2026 — check on hunter.io). The free plan gives 50 searches/month but does not include the API.

⚠️ Warning: each API call consumes credits. Domain Search = 1 credit, Email Verifier = 1 credit, Company Enrichment = 1 credit.

4. Phantombuster API: automated extraction

Phantombuster offers a REST API to launch, monitor, and retrieve the results of your Phantoms (automation scripts).

Getting an API key

Go to the Phantombuster Dashboard, Settings > API section, to generate your key. As with Hunter, store it in your .env file.

Launching a Phantom

To launch a Phantom via the API, you send a POST request to the launch endpoint with the Phantom ID and the required arguments (for example, the LinkedIn search URL and the number of pages to scrape). The API returns a containerId that allows you to track the progress of the task. A polling system (checking every 30 seconds) is necessary to wait for the status to change to completed.

Retrieving the results

Once the Phantom is finished, the API provides a URL pointing to a JSON or CSV file containing the extracted data. For a LinkedIn Profile Scraper, each entry typically includes the first name, last name, job title, company, and LinkedIn profile URL. Simply parse this file to integrate it into your pipeline.

Chained workflow

Phantombuster allows you to chain multiple Phantoms via the API to create a continuous extraction and enrichment workflow: the first Phantom extracts profiles from a LinkedIn search, then passes the result to the second Phantom which retrieves the details of each profile, then a third Phantom enriches the data with AI, and finally the last Phantom exports everything to Google Sheets. Each step launches automatically when the previous one is finished.

5. Cleaning and normalizing with Python

Complete cleaning script

The cleaning process takes place in several successive steps using the pandas library. First, we load the exports from each source (Prospeo, Hunter, Phantombuster) and standardize the column names (prenom → firstName, Email → email, etc.) via a mapping dictionary. Next, we merge the three DataFrames, remove duplicates based on the email address, and drop rows without an email. Normalization consists of lowercasing emails, title-casing first and last names, and validating email formats with a regular expression. The final result is exported to CSV. On a typical file of 1,250 raw contacts, this script eliminates about 283 duplicates, 89 invalid emails, and 31 empty emails, yielding 847 clean leads.

6. Enriching with AI

Once the data is clean, AI adds value:

Batch enrichment

Batch enrichment involves sending each lead to a model like GPT-4o-mini with a structured prompt requesting three pieces of information in JSON: an icp_score (1-10) measuring the match with your ideal customer, an intent_signal detecting a potential buying signal, and a personalization_hook generating a one-sentence personalization angle. Leads are processed in batches of 50 with a 5-second pause between each batch to respect the API rate limits. The results are added as new columns to the DataFrame before final export. If you want to go further in using AI to create content from these enriched data, discover how to automatically generate content with AI.

Enrichment cost

With GPT-4o-mini (~$0.15 / 1M input tokens, price as of May 2026 — check on openai.com/pricing):
- 1,000 leads: about $0.30-0.50
- 10,000 leads: about $3-5

A negligible cost compared to the value of the enriched leads.

7. Automate the complete pipeline

Final architecture

The architecture of the final pipeline revolves around three data sources (Phantombuster API, Hunter API, and Prospeo API), each feeding a raw file. These three files are then consolidated by a Python cleaning script that generates a leads_clean.csv file. This clean file goes through the Hunter API for email verification, and then through AI enrichment via GPT-4o-mini. Finally, a Python segmentation script divides the enriched leads into three separate files (hot, warm, cold) which are respectively sent to the CRM, a nurture sequence, or put on hold.

Automation script with scheduling

The main automation script (pipeline.py) chains the different steps of the pipeline via subprocess calls to individual Python scripts (Phantombuster extraction, Hunter extraction, cleaning, verification, AI enrichment, segmentation, CRM export). Each step is logged with its timestamp. If a step fails, the pipeline stops and displays the error, preventing corrupted data from propagating to subsequent steps. This script is designed to be launched without human intervention.

Scheduling

To run this pipeline automatically, add a crontab line scheduling the launch of pipeline.py every Monday at 6 AM, redirecting the logs to a pipeline.log file. Result: every Monday morning, your lead file is clean, verified, enriched, and segmented — without manual intervention. To go further with this type of automation, check out our guide on Cron + AI: automating smart tasks 24/7.

❌ Common mistakes

Verifying before deduplicating: if you verify duplicate emails, you consume API credits unnecessarily. Always deduplicate first.
Enriching non-normalized data: AI will struggle to score a lead if the job title is written in 5 different formats (CTO, cto, C.T.O., Chief Technology Officer, Directeur technique).
Ignoring rate limits: Hunter.io and OpenAI APIs block requests that are too fast. Plan for pauses between batches.
Not logging errors: without logs, a pipeline that crashes in the middle of the night is impossible to debug.
Forgetting the .env file: storing your API keys directly in the source code is a major security risk, especially if the repo is public.

❓ FAQ

How long does a complete pipeline take for 2,000 leads?
Around 15 to 30 minutes depending on the speed of the Phantombuster API for LinkedIn extraction and the rate limits of the OpenAI API for enrichment.

Can Hunter.io be replaced with another verification tool?
Yes, alternatives like Prospeo, MillionVerifier, or NeverBounce offer equivalent APIs. The pipeline logic remains identical, only the endpoints change.

What to do if the Phantombuster Phantom fails often?
Check that your LinkedIn session is active (Phantoms require a valid session cookie). Renew it regularly via the Phantombuster dashboard.

Is Python absolutely necessary?
No, you can replicate the same logic with Google Sheets + Apps Script for more modest volumes (< 500 leads). Python becomes essential beyond that for managing DataFrames and automated chaining.

🛠️ Recommended tools

Tool	Usage	Link
Hunter.io	Email verification and enrichment	Official website
Phantombuster	Automated LinkedIn extraction	Official website
OpenAI (GPT-4o-mini)	AI lead enrichment	Official website
Python + pandas	Cleaning and normalization	Official website
Prospeo	Email scraping and verification	Official website
Hostinger	Hosting for your automation scripts	Official website

✅ Conclusion

Dirty data is wasted money. A bounce rate > 2% destroys your email reputation, duplicates skew your KPIs, and non-normalized data makes any serious automation impossible.

The pipeline in 5 steps:
1. Extraction: Phantombuster + Hunter.io + Prospeo
2. Cleaning: deduplication + pandas normalization
3. Verification: Hunter.io Email Verifier API
4. AI Enrichment: ICP scoring + intent signals via GPT-4o-mini
5. Segmentation: hot / warm / cold → CRM or nurture

Estimated cost: a few dollars to enrich 1,000 leads via AI — negligible compared to the value of a clean pipeline.

Next step: your data is clean and enriched. Now, use your ICP to identify the best prospects, then contact them with personalized messages thanks to smart scraping with AI. If your target is international, you can also translate your content automatically with AI to adapt your outreach sequences.

A verified and enriched lead is worth 10 raw leads. Invest in quality, not volume.
```

#Data Enrichment #Hunter.io #Python #api #automation

📚 Related articles

Automatisation 🟢 Débutant 12 min

Visa × ChatGPT and Mastercard Agent Pay: AI agents can now pay on your behalf — the race for autonomous payments

Visa & Mastercard launch autonomous AI payments. Discover how Visa × ChatGPT & Agent Pay are revolutionizing agentic commerce.

2026-06-25 18:06

Générer du contenu automatiquement avec l'IA

Automatisation 🟡 Intermédiaire 16 min

Automatically generate content with AI

Full AI content pipeline: brief, writing, SEO, translation, images. Night worker pattern & human review for quality content.

2026-02-24 09:51

Traduire son contenu automatiquement avec l'IA

Automatisation 🟡 Intermédiaire 16 min

Automatically translate your content with AI

Learn to auto-translate articles with LLMs. Full pipeline: detection, contextual translation, quality review. Free models included.

2026-02-24 10:26

📑 Table of contents

🔄 Introduction: from raw file to clean pipeline

📋 The essentials

1. The problem with raw data

What can go wrong

The real cost of a dirty file

2. Architecture of a cleaning pipeline

Tools by step

3. Hunter.io API: enrichment and verification

Getting an API key

Domain Search: finding a company's emails

Email Verifier: verifying an email

Company Enrichment: enriching a company

API costs

4. Phantombuster API: automated extraction

Getting an API key

Launching a Phantom

Retrieving the results

Chained workflow

5. Cleaning and normalizing with Python

Complete cleaning script

6. Enriching with AI

Batch enrichment

Enrichment cost

7. Automate the complete pipeline

Final architecture

Automation script with scheduling

Scheduling

❌ Common mistakes

❓ FAQ

🛠️ Recommended tools

✅ Conclusion

📚 Related articles

Visa × ChatGPT and Mastercard Agent Pay: AI agents can now pay on your behalf — the race for autonomous payments

Automatically generate content with AI

Automatically translate your content with AI