📑 Table of contents

Smart。您提供了一段需要翻译的标题文本。以下是按照您的要求(保留格式、不翻译工具名)的英文翻译: Smart scraping: enrich your data with Hunter, Phantombuster and AI

Automatisation 🔴 Advanced ⏱️ 10 min read 📅 2026-05-05

🔄 Introduction: from raw file to clean pipeline

You exported 2,000 contacts from LinkedIn, Hunter, and Phantombuster. The problem? The data is in 3 different formats, the columns don't match, there are duplicates everywhere, and 30% of the emails are probably invalid.

Without cleaning, this file is worthless. A sales lead file costs more than it earns — bouncing emails, wasted time on bad contacts, polluted CRM.

This guide shows you how to use the Hunter.io API, the Phantombuster API, and AI to transform raw data into a clean, enriched, and actionable file. To understand the basics of data extraction, check out our guide on AI smart scraping.


📋 The essentials

  • A cleaning pipeline follows 6 steps: extraction, deduplication, normalization, verification, enrichment, and export.
  • The Hunter.io API allows you to verify email deliverability and enrich company data (industry, headcount, revenue).
  • The Phantombuster API is used to automatically extract LinkedIn profiles and chain multiple automation scripts.
  • AI enrichment (GPT-4o-mini) costs about $0.30 to $0.50 per 1,000 leads and adds ICP scoring as well as intent signals.
  • Full automation via crontab allows you to get a clean, verified, and segmented file every week without manual intervention.

1. The problem with raw data

What can go wrong

When you scrape data from multiple sources, here are the common problems:

Problem Example Impact
Duplicates Same contact 3 times Inflates volume, spam
Invalid emails jean.dupont@gmail (missing extension) Bounce, destroyed email reputation
Inconsistent formats Jean, jean, M. Jean Impossible to match
Missing data No job title, no company Hard to personalize
Outdated data Contact changed jobs Irrelevant message
Broken encoding Jean Dupont \xa0 Paris CSV errors, display issues

The real cost of a dirty file

  • Bounce rate > 2%: your email domain risks being blacklisted
  • 50% duplicates: you think you have 1,000 leads, you actually have 500
  • Unnormalized data: impossible to do scoring or segmentation

💡 Rule: invest 30% of your prospecting time in data cleaning. The remaining 70% (outreach) will be 3x more effective.


2. Architecture of a cleaning pipeline

A robust pipeline follows these steps in order:

EXTRACTION → DEDUPLICATION → NORMALIZATION → VERIFICATION → ENRICHMENT → EXPORT

Each step depends on the previous one. No enrichment before normalization, no verification before deduplication.

Tools by step

Step Tool Automatable?
Extraction Phantombuster, Apify, Prospeo ✅ API + scheduling
Deduplication Python + pandas ✅ Script
Normalization Python + regex ✅ Script
Email verification Hunter.io, Prospeo ✅ API
AI enrichment OpenAI (GPT-4o-mini), Claude API ✅ API
Export CSV, Google Sheets, CRM ✅ Script

3. Hunter.io API: enrichment and verification

Hunter.io exposes a comprehensive REST API to automate email searching, verification, and enrichment.

Getting an API key

To get started, create an account on Hunter.io, then go to Dashboard > API > Create API Key. Store this key in a .env file at the root of your project to secure your access.

Domain Search: finding a company's emails

The Domain Search endpoint lists the public email addresses of a given domain. It returns the first name, last name, and job title for each email, making it easy to filter by role. A single API call is enough to retrieve up to 50 emails associated with a company like Stripe.

Email Verifier: verifying an email

The Email Verifier analyzes an email address based on several criteria: syntax format, domain existence, and SMTP verification. It returns a deliverability score out of 100 and a status among deliverable, risky, undeliverable, or unknown. This tool is essential for eliminating emails that would destroy your sender reputation.

Company Enrichment: enriching a company

The Company Enrichment endpoint takes a domain as input and returns structured information about the company: name, industry, number of employees, and estimated annual revenue. This data allows you to quickly qualify a lead according to your ICP (ideal customer).

API costs

The Hunter.io API is included in all paid plans (starting at $34/month, price as of May 2026 — check on hunter.io). The free plan gives 50 searches/month but does not include the API.

⚠️ Warning: each API call consumes credits. Domain Search = 1 credit, Email Verifier = 1 credit, Company Enrichment = 1 credit.


4. Phantombuster API: automated extraction

Phantombuster offers a REST API to launch, monitor, and retrieve the results of your Phantoms (automation scripts).

Getting an API key

Go to the Phantombuster Dashboard, Settings > API section, to generate your key. As with Hunter, store it in your .env file.

Launching a Phantom

To launch a Phantom via the API, you send a POST request to the launch endpoint with the Phantom ID and the required arguments (for example, the LinkedIn search URL and the number of pages to scrape). The API returns a containerId that allows you to track the progress of the task. A polling system (checking every 30 seconds) is necessary to wait for the status to change to completed.

Retrieving the results

Once the Phantom is finished, the API provides a URL pointing to a JSON or CSV file containing the extracted data. For a LinkedIn Profile Scraper, each entry typically includes the first name, last name, job title, company, and LinkedIn profile URL. Simply parse this file to integrate it into your pipeline.

Chained workflow

Phantombuster allows you to chain multiple Phantoms via the API to create a continuous extraction and enrichment workflow: the first Phantom extracts profiles from a LinkedIn search, then passes the result to the second Phantom which retrieves the details of each profile, then a third Phantom enriches the data with AI, and finally the last Phantom exports everything to Google Sheets. Each step launches automatically when the previous one is finished.


5. Cleaning and normalizing with Python

Complete cleaning script

The cleaning process takes place in several successive steps using the pandas library. First, we load the exports from each source (Prospeo, Hunter, Phantombuster) and standardize the column names (prenomfirstName, Emailemail, etc.) via a mapping dictionary. Next, we merge the three DataFrames, remove duplicates based on the email address, and drop rows without an email. Normalization consists of lowercasing emails, title-casing first and last names, and validating email formats with a regular expression. The final result is exported to CSV. On a typical file of 1,250 raw contacts, this script eliminates about 283 duplicates, 89 invalid emails, and 31 empty emails, yielding 847 clean leads.


6. Enriching with AI

Once the data is clean, AI adds value:

Batch enrichment

Batch enrichment involves sending each lead to a model like GPT-4o-mini with a structured prompt requesting three pieces of information in JSON: an icp_score (1-10) measuring the match with your ideal customer, an intent_signal detecting a potential buying signal, and a personalization_hook generating a one-sentence personalization angle. Leads are processed in batches of 50 with a 5-second pause between each batch to respect the API rate limits. The results are added as new columns to the DataFrame before final export. If you want to go further in using AI to create content from these enriched data, discover how to automatically generate content with AI.

Enrichment cost

With GPT-4o-mini (~$0.15 / 1M input tokens, price as of May 2026 — check on openai.com/pricing):
- 1,000 leads: about $0.30-0.50
- 10,000 leads: about $3-5

A negligible cost compared to the value of the enriched leads.


7. Automate the complete pipeline

Final architecture

The architecture of the final pipeline revolves around three data sources (Phantombuster API, Hunter API, and Prospeo API), each feeding a raw file. These three files are then consolidated by a Python cleaning script that generates a leads_clean.csv file. This clean file goes through the Hunter API for email verification, and then through AI enrichment via GPT-4o-mini. Finally, a Python segmentation script divides the enriched leads into three separate files (hot, warm, cold) which are respectively sent to the CRM, a nurture sequence, or put on hold.

Automation script with scheduling

The main automation script (pipeline.py) chains the different steps of the pipeline via subprocess calls to individual Python scripts (Phantombuster extraction, Hunter extraction, cleaning, verification, AI enrichment, segmentation, CRM export). Each step is logged with its timestamp. If a step fails, the pipeline stops and displays the error, preventing corrupted data from propagating to subsequent steps. This script is designed to be launched without human intervention.

Scheduling

To run this pipeline automatically, add a crontab line scheduling the launch of pipeline.py every Monday at 6 AM, redirecting the logs to a pipeline.log file. Result: every Monday morning, your lead file is clean, verified, enriched, and segmented — without manual intervention. To go further with this type of automation, check out our guide on Cron + AI: automating smart tasks 24/7.


❌ Common mistakes

  • Verifying before deduplicating: if you verify duplicate emails, you consume API credits unnecessarily. Always deduplicate first.
  • Enriching non-normalized data: AI will struggle to score a lead if the job title is written in 5 different formats (CTO, cto, C.T.O., Chief Technology Officer, Directeur technique).
  • Ignoring rate limits: Hunter.io and OpenAI APIs block requests that are too fast. Plan for pauses between batches.
  • Not logging errors: without logs, a pipeline that crashes in the middle of the night is impossible to debug.
  • Forgetting the .env file: storing your API keys directly in the source code is a major security risk, especially if the repo is public.

❓ FAQ

How long does a complete pipeline take for 2,000 leads?
Around 15 to 30 minutes depending on the speed of the Phantombuster API for LinkedIn extraction and the rate limits of the OpenAI API for enrichment.

Can Hunter.io be replaced with another verification tool?
Yes, alternatives like Prospeo, MillionVerifier, or NeverBounce offer equivalent APIs. The pipeline logic remains identical, only the endpoints change.

What to do if the Phantombuster Phantom fails often?
Check that your LinkedIn session is active (Phantoms require a valid session cookie). Renew it regularly via the Phantombuster dashboard.

Is Python absolutely necessary?
No, you can replicate the same logic with Google Sheets + Apps Script for more modest volumes (< 500 leads). Python becomes essential beyond that for managing DataFrames and automated chaining.


Tool Usage Link
Hunter.io Email verification and enrichment Official website
Phantombuster Automated LinkedIn extraction Official website
OpenAI (GPT-4o-mini) AI lead enrichment Official website
Python + pandas Cleaning and normalization Official website
Prospeo Email scraping and verification Official website
Hostinger Hosting for your automation scripts Official website

✅ Conclusion

Dirty data is wasted money. A bounce rate > 2% destroys your email reputation, duplicates skew your KPIs, and non-normalized data makes any serious automation impossible.

The pipeline in 5 steps:
1. Extraction: Phantombuster + Hunter.io + Prospeo
2. Cleaning: deduplication + pandas normalization
3. Verification: Hunter.io Email Verifier API
4. AI Enrichment: ICP scoring + intent signals via GPT-4o-mini
5. Segmentation: hot / warm / cold → CRM or nurture

Estimated cost: a few dollars to enrich 1,000 leads via AI — negligible compared to the value of a clean pipeline.

Next step: your data is clean and enriched. Now, use your ICP to identify the best prospects, then contact them with personalized messages thanks to smart scraping with AI. If your target is international, you can also translate your content automatically with AI to adapt your outreach sequences.


A verified and enriched lead is worth 10 raw leads. Invest in quality, not volume.
```