📑 Table des matières

Vision IA : analyser des images avec les LLM

LLM & Modèles 🟡 Intermédiaire ⏱️ 13 min de lecture 📅 2026-02-24

AI Vision: Analyzing Images with LLMs

LLMs are no longer limited to reading text. Multimodal models like Claude 3.5, GPT-4V, and Gemini Pro Vision can see and understand images. OCR, photo analysis, visual QA, design review... the use cases are vast.

In this guide, we explore the available vision models, their strengths, and provide concrete examples using APIs.


👁️ Vision Models in 2025

What is a Multimodal LLM?

A classic LLM takes text as input and produces text as output. A multimodal LLM also accepts images (and sometimes audio or video) as input.

Classic LLM:    Text → Text
Multimodal LLM:   Text + Image → Text

The model "sees" the image through a visual encoder (often a Vision Transformer, ViT) that converts the image into tokens understandable by the LLM.

Main Models

Model Publisher Max Resolution Strengths Price (input)
Claude 3.5 Sonnet Anthropic 8000×8000 Detailed analysis, reasoning ~$3/M tokens
Claude 3.5 Haiku Anthropic 8000×8000 Fast, good value ~$0.80/M tokens
GPT-4o OpenAI 2048×2048 Versatile, excellent OCR ~$2.50/M tokens
GPT-4o-mini OpenAI 2048×2048 Budget-friendly, good ~$0.15/M tokens
Gemini 2.0 Flash Google Very high Large context, fast ~$0.10/M tokens
Gemini 1.5 Pro Google Very high Native video, 2M tokens ~$1.25/M tokens
Llama 3.2 Vision Meta 1120×1120 Open source, local Free (self-hosted)

How to Choose?

Need precise OCR? → GPT-4o or Claude 3.5 Sonnet
Need detailed analysis? → Claude 3.5 Sonnet
Tight budget? → GPT-4o-mini or Gemini Flash
Sensitive data (local)? → Llama 3.2 Vision
Video analysis? → Gemini 1.5 Pro
High volume? → Gemini Flash or GPT-4o-mini

🔍 Concrete Use Cases

1. OCR — Extracting Text from Images

Classic OCR (Tesseract) is limited to well-formatted text. Vision LLMs understand context: they can read a receipt, a handwritten table, or a screenshot with complex layout.

Example: Extracting data from an invoice

import anthropic
import base64

client = anthropic.Anthropic()

# Load the image
with open("invoice.png", "rb") as f:
    image_data = base64.standard_b64encode(f.read()).decode("utf-8")

message = client.messages.create(
    model="claude-3-5-sonnet-20241022",
    max_tokens=1024,
    messages=[
        {
            "role": "user",
            "content": [
                {
                    "type": "image",
                    "source": {
                        "type": "base64",
                        "media_type": "image/png",
                        "data": image_data,
                    },
                },
                {
                    "type": "text",
                    "text": (
                        "Extract all information from this invoice "
                        "in JSON format: number, date, supplier, "
                        "lines (description, quantity, unit price, total), "
                        "total before tax, VAT, total including tax."
                    )
                }
            ],
        }
    ],
)

print(message.content[0].text)

Typical result:

{
  "invoice_number": "IN-2025-0142",
  "date": "2025-01-15",
  "supplier": "TechServ SARL",
  "lines": [
    {
      "description": "VPS Standard Hosting",
      "quantity": 1,
      "unit_price": 29.99,
      "total": 29.99
    },
    {
      "description": ".fr Domain Name",
      "quantity": 1,
      "unit_price": 12.00,
      "total": 12.00
    }
  ],
  "total_before_tax": 41.99,
  "vat": 8.40,
  "total_including_tax": 50.39
}

Comparison of classic OCR vs. LLM Vision:

Criterion Tesseract (OCR) LLM Vision
Simple printed text ✅ Excellent ✅ Excellent
Handwritten text ❌ Poor ✅ Good
Complex tables ❌ Often fails ✅ Understands structure
Semantic context ❌ None ✅ Understands meaning
Multilingual ⚡ With config ✅ Native
Cost Free Paid (API)
Speed ⚡ Very fast 🐢 Slower

2. Photo Analysis — Understanding Visual Content

Vision LLMs don't just read text; they understand what they see: objects, people, scenes, emotions, style.

Example: Analyzing a product photo

from openai import OpenAI

client = OpenAI()

response = client.chat.completions.create(
    model="gpt-4o",
    messages=[
        {
            "role": "user",
            "content": [
                {
                    "type": "text",
                    "text": (
                        "Analyze this product photo for an e-commerce site. "
                        "Provide:\n"
                        "1. Short description (1 sentence)\n"
                        "2. Long description (SEO, 100 words)\n"
                        "3. Tags (5-10 keywords)\n"
                        "4. Dominant colors\n"
                        "5. Suggestions for improving the photo"
                    )
                },
                {
                    "type": "image_url",
                    "image_url": {
                        "url": "https://example.com/product.jpg",
                        "detail": "high"  # high, low, or auto
                    }
                }
            ],
        }
    ],
    max_tokens=500
)

print(response.choices[0].message.content)

3. Design Review and Interface Analysis

A powerful use case: having a UI/UX design reviewed by a vision LLM.

import anthropic
import base64

client = anthropic.Anthropic()

with open("design.png", "rb") as f:
    img = base64.standard_b64encode(f.read()).decode("utf-8")

message = client.messages.create(
    model="claude-3-5-sonnet-20241022",
    max_tokens=2048,
    messages=[
        {
            "role": "user",
            "content": [
                {
                    "type": "image",
                    "source": {
                        "type": "base64",
                        "media_type": "image/png",
                        "data": img,
                    },
                },
                {
                    "type": "text",
                    "text": """Analyze this web interface design.

Rate out of 10:
1. Visual hierarchy
2. Readability
3. Spacing consistency
4. Accessibility (contrast, text size)
5. Responsive readiness

For each point < 8/10, suggest a concrete improvement.
Identify any potential UX issues."""
                }
            ],
        }
    ],
)

print(message.content[0].text)

Typical result:

## UX Analysis of the Design

### 1. Visual Hierarchy: 7/10
The main title is visible, but secondary CTAs have the same visual weight as the primary CTA.
→ **Suggestion**: reduce the size of secondary buttons, increase the contrast of the primary button.

### 2. Readability: 8/10
Good font size, correct line spacing.
Slight lack of contrast on light gray text (#999) on white background.
→ **Suggestion**: change gray to #666 minimum (4.5:1 ratio).
...

4. Visual QA — Detecting Interface Bugs

Compare a design mockup with a screenshot of the implementation:

import anthropic
import base64

client = anthropic.Anthropic()

def load_image(path):
    with open(path, "rb") as f:
        return base64.standard_b64encode(f.read()).decode("utf-8")

design = load_image("figma_design.png")
implementation = load_image("prod_screenshot.png")

message = client.messages.create(
    model="claude-3-5-sonnet-20241022",
    max_tokens=2048,
    messages=[
        {
            "role": "user",
            "content": [
                {
                    "type": "text",
                    "text": "Image 1: the Figma design (reference)"
                },
                {
                    "type": "image",
                    "source": {
                        "type": "base64",
                        "media_type": "image/png",
                        "data": design,
                    },
                },
                {
                    "type": "text",
                    "text": "Image 2: the screenshot of the implementation"
                },
                {
                    "type": "image",
                    "source": {
                        "type": "base64",
                        "media_type": "image/png",
                        "data": implementation,
                    },
                },
                {
                    "type": "text",
                    "text": """Compare the design (image 1) and the implementation (image 2).

List ALL visual differences:
- Different colors
- Incorrect spacing
- Missing or extra elements
- Different fonts
- Misaligned elements
- Different sizes

Format: table with columns Element | Design | Implementation | Severity (high/medium/low)"""
                }
            ],
        }
    ],
)

print(message.content[0].text)

💻 Using Vision with APIs

Anthropic API (Claude)

import anthropic
import base64
import httpx

client = anthropic.Anthropic()

# Method 1: Image in base64
with open("image.png", "rb") as f:
    image_base64 = base64.standard_b64encode(f.read()).decode("utf-8")

# Method 2: Image from URL
image_url = "https://example.com/image.jpg"
image_data = base64.standard_b64encode(
    httpx.get(image_url).content
).decode("utf-8")

message = client.messages.create(
    model="claude-3-5-sonnet-20241022",
    max_tokens=1024,
    messages=[
        {
            "role": "user",
            "content": [
                {
                    "type": "image",
                    "source": {
                        "type": "base64",
                        "media_type": "image/jpeg",
                        "data": image_data,
                    },
                },
                {
                    "type": "text",
                    "text": "Describe this image in detail."
                }
            ],
        }
    ],
)

Formats supported by Claude: JPEG, PNG, GIF, WebP
Max size: 20 MB per image, up to 100 images per request

OpenAI API (GPT-4V / GPT-4o)

from openai import OpenAI

client = OpenAI()

# Method 1: Direct URL
response = client.chat.completions.create(
    model="gpt-4o",
    messages=[
        {
            "role": "user",
            "content": [
                {"type": "text", "text": "What do you see?"},
                {
                    "type": "image_url",
                    "image_url": {
                        "url": "https://example.com/image.jpg",
                        "detail": "high"  # low, high, auto
                    }
                },
            ],
        }
    ],
    max_tokens=300,
)

# Method 2: Base64
import base64

with open("image.png", "rb") as f:
    b64 = base64.b64encode(f.read()).decode()

response = client.chat.completions.create(
    model="gpt-4o",
    messages=[
        {
            "role": "user",
            "content": [
                {"type": "text", "text": "Analyze this image."},
                {
                    "type": "image_url",
                    "image_url": {
                        "url": f"data:image/png;base64,{b64}"
                    }
                },
            ],
        }
    ],
    max_tokens=300,
)