๐Ÿ“‘ Table of contents

AI Vision: Analyzing Images with LLMs

AI Vision: Analyzing Images with LLMs

LLM & Modรจles ๐ŸŸก Intermediate โฑ๏ธ 12 min read ๐Ÿ“… 2026-02-24

AI Vision: analyzing images with LLMs

LLMs are no longer limited to reading text. Multimodal models like Claude 3.5, GPT-4V, and Gemini Pro Vision can see and understand images. OCR, photo analysis, visual QA, design mockup reviews... the use cases are immense.

In this guide, we explore the available vision models, their strengths, and code concrete examples using the APIs.


The essentials

  • Multimodal LLMs combine text and images as input to produce text as output, thanks to a visual encoder (Vision Transformer).
  • The main vision models in 2025: Claude 3.5 Sonnet (detailed analysis), GPT-4o (versatile, excellent OCR), Gemini 2.0 Flash (fast, massive context), and Llama 3.2 Vision (open source, local).
  • Major use cases: intelligent OCR, product photo analysis, UI/UX mockup review, visual QA, accessibility audit, automatic classification, data extraction from charts.
  • Vision cost depends directly on the resolution of the sent image: optimizing the size before sending can divide the bill by 10.
  • A multi-model strategy (GPT-4o-mini for classification, Claude 3.5 Sonnet for fine analysis) allows you to control costs without sacrificing quality.

๐Ÿ‘๏ธ Vision models in 2025

What is a multimodal LLM?

A classic LLM takes text as input and produces text as output. A multimodal LLM also accepts images (and sometimes audio or video) as input.

Classic LLM:    Text โ†’ Text
Multimodal LLM:   Text + Image โ†’ Text

The model "sees" the image thanks to a visual encoder (often a Vision Transformer, ViT) that converts the image into tokens understandable by the LLM.

Main models

Model Publisher Max resolution Key strengths Price (input)
Claude 3.5 Sonnet Anthropic 8000ร—8000 Detailed analysis, reasoning ~$3/M tokens
Claude 3.5 Haiku Anthropic 8000ร—8000 Fast, good quality/price ratio ~$0.80/M tokens
GPT-4o OpenAI 2048ร—2048 Versatile, excellent OCR ~$2.50/M tokens
GPT-4o-mini OpenAI 2048ร—2048 Budget-friendly, decent ~$0.15/M tokens
Gemini 2.0 Flash Google Very high Huge context, fast ~$0.10/M tokens
Gemini 1.5 Pro Google Very high Native video, 2M tokens ~$1.25/M tokens
Llama 3.2 Vision Meta 1120ร—1120 Open source, local Free (self-hosted)

How to choose?

Need accurate OCR? โ†’ GPT-4o or Claude 3.5 Sonnet
Need detailed analysis? โ†’ Claude 3.5 Sonnet
Tight budget? โ†’ GPT-4o-mini or Gemini Flash
Sensitive data (local)? โ†’ Llama 3.2 Vision
Video analysis? โ†’ Gemini 1.5 Pro
High volume? โ†’ Gemini Flash or GPT-4o-mini

๐Ÿ” Concrete use cases

1. OCR โ€” Text extraction from images

Traditional OCR (Optical Character Recognition) (Tesseract) is limited to well-formatted texts. Vision LLMs understand the context: they can read a receipt, a handwritten table, or a screenshot with a complex layout.

Example: extracting data from an invoice

Via the Anthropic API, the image is sent in base64 along with a structured prompt requesting a JSON with the invoice number, date, supplier, detailed line items, and subtotal/VAT/total amounts. The model directly returns a usable JSON, ready to be injected into a database or an ERP.

Typical result:

{
  "numero_facture": "FA-2025-0142",
  "date": "2025-01-15",
  "fournisseur": "TechServ SARL",
  "lignes": [
    {
      "description": "Hรฉbergement VPS Standard",
      "quantite": 1,
      "prix_unitaire": 29.99,
      "total": 29.99
    },
    {
      "description": "Nom de domaine .fr",
      "quantite": 1,
      "prix_unitaire": 12.00,
      "total": 12.00
    }
  ],
  "total_ht": 41.99,
  "tva": 8.40,
  "total_ttc": 50.39
}

Comparison: Traditional OCR vs Vision LLM:

Criterion Tesseract (OCR) Vision LLM
Simple printed text โœ… Excellent โœ… Excellent
Handwritten text โŒ Poor โœ… Good
Complex tables โŒ Often fails โœ… Understands structure
Semantic context โŒ None โœ… Understands meaning
Multilingual โšก With config โœ… Native
Cost Free Paid (API)
Speed โšก Very fast ๐Ÿข Slower

2. Photo analysis โ€” Understanding visual content

Vision LLMs don't just read text. They understand what they see: objects, people, scenes, emotions, style.

Example: analyzing a product photo

The OpenAI API allows you to send an image via URL with the detail parameter (low, high, or auto). By combining this with a structured prompt, you get in a single API call: a short description, a long SEO description, tags, dominant colors, and suggestions for improving the photo. Ideal for automatically populating e-commerce product listings.

3. Reviewing mockups and interfaces

A powerful use case: having a vision LLM review a UI/UX mockup. Via the Anthropic API, the mockup is sent in base64 with a prompt asking for a numerical score (out of 10) on the visual hierarchy, readability, spacing consistency, accessibility, and responsive-readiness. The model identifies weak points and proposes concrete corrections.

Typical result:

## ๐ŸŽจ UX Analysis of the Mockup

### 1. Visual Hierarchy: 7/10
The main title is clearly visible, but the secondary CTAs
have the same visual weight as the primary CTA.
โ†’ **Suggestion**: reduce the size of the secondary buttons,
  increase the contrast of the primary button.

### 2. Readability: 8/10
Good font size, correct line spacing.
Slight lack of contrast on the light gray text (#999) on a white background.
โ†’ **Suggestion**: change the gray to #666 minimum (4.5:1 ratio).

### 3. Spacing Consistency: 6/10
The margins between sections vary (32px, 24px, 40px).
โ†’ **Suggestion**: standardize to 32px or use a spacing
  system (8px grid).
...

4. Visual QA โ€” Interface Bug Detection

Compare a mockup and a screenshot of the implementation by sending both images to Claude 3.5 Sonnet. The prompt requests a structured table with the columns Element | Mockup | Implementation | Severity (high/medium/low). The model detects color differences, spacing issues, missing or extra elements, different fonts, and misaligned elements.

5. Accessibility โ€” Automated Web Image Audit

The Anthropic API can automatically audit a web image: the image is sent along with its current alt text, and the model returns a JSON containing the alt text quality (good, medium, or poor), a suggested alt text, whether the image is decorative, whether it contains embedded text, contrast issues, and a list of recommendations. Good alt text is concise, descriptive, and conveys the essential information of the image.

6. Automated Photo Classification and Sorting

Claude 3.5 Haiku is sufficient for simple and fast classification. The principle: send each image with a prompt asking to respond with a single word from a list of predefined categories (landscape, portrait, food, animal, architecture, document, screenshot, product, other). A Python script scans a source folder, classifies each image, and copies it into the corresponding subfolder, with a statistical summary at the end of the process.

7. Data Extraction from Charts

Vision LLMs can read charts and extract the underlying data. Via the Anthropic API, the chart is sent with a prompt requesting a structured JSON containing the chart type, title, axes, extracted data, and 2-3 key observations. To understand the cost implications associated with this type of processing, check out our guide on LLM billing (tokens, context, costs).

8. Visual Monitoring and Anomaly Detection

Combine AI vision with a camera or screenshots to detect changes. By sending two images (before/after) to Claude 3.5 Sonnet with a structured prompt, the model returns a JSON indicating whether there are changes, their severity (none/low/medium/high/critical), a detailed list of modifications (addition, deletion, modification), and a summary. Ideal for automated visual monitoring of web pages.


๐Ÿ’ป Using vision with APIs

Anthropic API (Claude)

The Anthropic API accepts base64 images directly in the message's content field. Supported formats: JPEG, PNG, GIF, WebP. Max size: 20 MB per image, up to 100 images per request. You can load an image from a local file or from a URL (via httpx for example).

OpenAI API (GPT-4V / GPT-4o)

The OpenAI API offers two methods: sending via direct URL (the simplest) or in base64 (via a data URI). The detail parameter controls the precision:

Value Tokens consumed Use case
low ~85 fixed tokens Quick preview, classification
high ~85 + 170ร—tiles Detailed analysis, OCR
auto Chosen by the model Default

Google API (Gemini)

The google-generativeai SDK allows you to pass a PIL.Image object directly to generate_content, from a local file or from a URL. Gemini's key advantage: a 2M token context, which allows for analyzing entire videos frame by frame.

Via OpenRouter (all models)

If you use OpenRouter, you access all these models via a single API compatible with the OpenAI format. Simply change the base_url and the model name (anthropic/claude-3.5-sonnet, google/gemini-2.0-flash, etc.) to switch providers without modifying the rest of the code.


โšก Optimizing performance and costs

Reducing image size

High-resolution images consume a lot of tokens. Before sending, resize with PIL: if the largest dimension exceeds a threshold (e.g., 1024 px), apply a downscaling ratio with Image.LANCZOS, then convert to JPEG with a configurable quality (85 by default). This simple optimization can divide token consumption by 4 to 10.

Vision cost calculation

Claude (Anthropic): tokens are calculated using the formula Tokens โ‰ˆ (width ร— height) / 750. A 1000ร—1000 image consumes ~1334 tokens, a 4000ร—4000 one around ~21334 tokens.

GPT-4o (OpenAI): in low mode, it's a fixed 85 tokens. In high mode, it's 85 + 170 ร— the number of 512ร—512 tiles. Therefore, a 1024ร—1024 consumes ~765 tokens, a 2048ร—2048 ~2805 tokens.

Multi-model strategy

Adapt the model to the task: simple classification โ†’ GPT-4o-mini in low mode (~$0.001/image), OCR โ†’ GPT-4o in high mode (~$0.01/image), detailed analysis โ†’ Claude 3.5 Sonnet (~$0.02/image), bulk processing โ†’ Gemini Flash (~$0.001/image). This approach can divide costs by 20 while maintaining quality where it matters.


๐Ÿ› ๏ธ Practical project: automatic image analyzer

A complete script for automatic folder analysis works like this: it scans a directory, encodes each image in base64 with automatic MIME type detection, sends each image to Claude 3.5 Sonnet with a prompt requesting structured JSON (description, category, detected objects, contained text, dominant colors, mood, quality score, suggestions). The results are aggregated and saved in a rapport_images.json file. As output, you get for each image its short description, its category, and a quality score โ€” all without manual intervention.


๐Ÿ”ฎ The future of AI vision

  • Native video: Gemini already analyzes videos, others will follow
  • Real-time vision: camera stream analysis with < 1s latency
  • Generation + understanding: models that see AND create images
  • Visual agents: agents that navigate interfaces by "looking" at the screen
  • Plummeting costs: vision will become virtually free by the end of 2025

Vision + AI Agents

The most powerful combination: an AI agent that sees its environment. OpenClaw already uses vision to:

  • Analyze screenshots (browser automation)
  • Read images sent by the user
  • Visually verify results (QA)

This is the next frontier of autonomous AI: agents that understand the visual world as well as text. To go further on choosing models that integrate these capabilities, check out our article on DeepSeek V4 : deux nouveaux modรจles โ€” Pro et Flash โ€” changent la donne.


Common mistakes

  • Sending images that are too large: a 4K photo can consume 20,000+ tokens. Systematically resize before sending.
  • Using the most expensive model for everything: a GPT-4o-mini is sufficient for classification, no need to pay for Claude 3.5 Sonnet.
  • Forgetting the detail parameter with OpenAI: without specifying it, the model defaults to auto and may over-consume tokens.
  • Not structuring the output prompt: without asking for a specific JSON, the model returns free text that needs to be parsed manually.
  • Ignoring the image format: JPEG is 5 to 10ร— lighter than PNG for photos. Prefer it unless you need transparency.


FAQ

Can multiple images be sent in a single request?
Yes. Claude accepts up to 100 images per request, and OpenAI and Gemini also accept several. This is useful for comparison (mockup vs. implementation) or analyzing multi-page documents.

What resolution should be used?
It depends on the task. For classification, 512ร—512 is sufficient. For OCR or detailed analysis, aim for 1024ร—1024 to 1568ร—1568. Beyond that, the quality gain is marginal but the cost skyrockets.

Are open source models viable for vision?
Llama 3.2 Vision (1120ร—1120 max) is usable for simple tasks locally, but it falls far short of proprietary models on complex OCR or fine analysis. For local deployment with sensitive data, however, it is the best option. To learn more about recent open source alternatives, check out our article on Qwen3.6 : Alibaba dรฉbarque avec une nouvelle famille de modรจles LLM.

How should images containing sensitive text be handled?
If the data is sensitive (invoices, legal documents, medical data), prioritize a local model (Llama 3.2 Vision) or ensure that the API provider does not retain the data. Anthropic and OpenAI offer non-retention options for enterprise plans.

Does vision replace traditional OCR?
Not always. Tesseract is free, instantaneous, and works offline. For simple and well-formatted documents, it remains relevant. Vision LLMs, however, excel on complex, handwritten documents, or when it is necessary to understand the semantic context.


Conclusion

AI vision with LLMs is no longer an experiment: it is a ready-to-use production tool. Whether to automate OCR, audit interfaces, classify images, or extract data from charts, models like Claude 3.5 Sonnet, GPT-4o and Gemini 2.0 Flash cover most needs.

The key lies in three practices: choosing the right model based on the task (don't overpay), systematically optimizing image sizes before sending, and structuring output prompts to get directly actionable results. By applying these principles, AI vision becomes a concrete and controlled productivity lever.
```