AI Vision: analyzing images with LLMs
LLMs are no longer limited to reading text. Multimodal models like Claude 3.5, GPT-4V, and Gemini Pro Vision can see and understand images. OCR, photo analysis, visual QA, design mockup reviews... the use cases are immense.
In this guide, we explore the available vision models, their strengths, and code concrete examples using the APIs.
The essentials
- Multimodal LLMs combine text and images as input to produce text as output, thanks to a visual encoder (Vision Transformer).
- The main vision models in 2025: Claude 3.5 Sonnet (detailed analysis), GPT-4o (versatile, excellent OCR), Gemini 2.0 Flash (fast, massive context), and Llama 3.2 Vision (open source, local).
- Major use cases: intelligent OCR, product photo analysis, UI/UX mockup review, visual QA, accessibility audit, automatic classification, data extraction from charts.
- Vision cost depends directly on the resolution of the sent image: optimizing the size before sending can divide the bill by 10.
- A multi-model strategy (GPT-4o-mini for classification, Claude 3.5 Sonnet for fine analysis) allows you to control costs without sacrificing quality.
๐๏ธ Vision models in 2025
What is a multimodal LLM?
A classic LLM takes text as input and produces text as output. A multimodal LLM also accepts images (and sometimes audio or video) as input.
Classic LLM: Text โ Text
Multimodal LLM: Text + Image โ Text
The model "sees" the image thanks to a visual encoder (often a Vision Transformer, ViT) that converts the image into tokens understandable by the LLM.
Main models
| Model | Publisher | Max resolution | Key strengths | Price (input) |
|---|---|---|---|---|
| Claude 3.5 Sonnet | Anthropic | 8000ร8000 | Detailed analysis, reasoning | ~$3/M tokens |
| Claude 3.5 Haiku | Anthropic | 8000ร8000 | Fast, good quality/price ratio | ~$0.80/M tokens |
| GPT-4o | OpenAI | 2048ร2048 | Versatile, excellent OCR | ~$2.50/M tokens |
| GPT-4o-mini | OpenAI | 2048ร2048 | Budget-friendly, decent | ~$0.15/M tokens |
| Gemini 2.0 Flash | Very high | Huge context, fast | ~$0.10/M tokens | |
| Gemini 1.5 Pro | Very high | Native video, 2M tokens | ~$1.25/M tokens | |
| Llama 3.2 Vision | Meta | 1120ร1120 | Open source, local | Free (self-hosted) |
How to choose?
Need accurate OCR? โ GPT-4o or Claude 3.5 Sonnet
Need detailed analysis? โ Claude 3.5 Sonnet
Tight budget? โ GPT-4o-mini or Gemini Flash
Sensitive data (local)? โ Llama 3.2 Vision
Video analysis? โ Gemini 1.5 Pro
High volume? โ Gemini Flash or GPT-4o-mini
๐ Concrete use cases
1. OCR โ Text extraction from images
Traditional OCR (Optical Character Recognition) (Tesseract) is limited to well-formatted texts. Vision LLMs understand the context: they can read a receipt, a handwritten table, or a screenshot with a complex layout.
Example: extracting data from an invoice
Via the Anthropic API, the image is sent in base64 along with a structured prompt requesting a JSON with the invoice number, date, supplier, detailed line items, and subtotal/VAT/total amounts. The model directly returns a usable JSON, ready to be injected into a database or an ERP.
Typical result:
{
"numero_facture": "FA-2025-0142",
"date": "2025-01-15",
"fournisseur": "TechServ SARL",
"lignes": [
{
"description": "Hรฉbergement VPS Standard",
"quantite": 1,
"prix_unitaire": 29.99,
"total": 29.99
},
{
"description": "Nom de domaine .fr",
"quantite": 1,
"prix_unitaire": 12.00,
"total": 12.00
}
],
"total_ht": 41.99,
"tva": 8.40,
"total_ttc": 50.39
}
Comparison: Traditional OCR vs Vision LLM:
| Criterion | Tesseract (OCR) | Vision LLM |
|---|---|---|
| Simple printed text | โ Excellent | โ Excellent |
| Handwritten text | โ Poor | โ Good |
| Complex tables | โ Often fails | โ Understands structure |
| Semantic context | โ None | โ Understands meaning |
| Multilingual | โก With config | โ Native |
| Cost | Free | Paid (API) |
| Speed | โก Very fast | ๐ข Slower |
2. Photo analysis โ Understanding visual content
Vision LLMs don't just read text. They understand what they see: objects, people, scenes, emotions, style.
Example: analyzing a product photo
The OpenAI API allows you to send an image via URL with the detail parameter (low, high, or auto). By combining this with a structured prompt, you get in a single API call: a short description, a long SEO description, tags, dominant colors, and suggestions for improving the photo. Ideal for automatically populating e-commerce product listings.
3. Reviewing mockups and interfaces
A powerful use case: having a vision LLM review a UI/UX mockup. Via the Anthropic API, the mockup is sent in base64 with a prompt asking for a numerical score (out of 10) on the visual hierarchy, readability, spacing consistency, accessibility, and responsive-readiness. The model identifies weak points and proposes concrete corrections.
Typical result:
## ๐จ UX Analysis of the Mockup
### 1. Visual Hierarchy: 7/10
The main title is clearly visible, but the secondary CTAs
have the same visual weight as the primary CTA.
โ **Suggestion**: reduce the size of the secondary buttons,
increase the contrast of the primary button.
### 2. Readability: 8/10
Good font size, correct line spacing.
Slight lack of contrast on the light gray text (#999) on a white background.
โ **Suggestion**: change the gray to #666 minimum (4.5:1 ratio).
### 3. Spacing Consistency: 6/10
The margins between sections vary (32px, 24px, 40px).
โ **Suggestion**: standardize to 32px or use a spacing
system (8px grid).
...
4. Visual QA โ Interface Bug Detection
Compare a mockup and a screenshot of the implementation by sending both images to Claude 3.5 Sonnet. The prompt requests a structured table with the columns Element | Mockup | Implementation | Severity (high/medium/low). The model detects color differences, spacing issues, missing or extra elements, different fonts, and misaligned elements.
5. Accessibility โ Automated Web Image Audit
The Anthropic API can automatically audit a web image: the image is sent along with its current alt text, and the model returns a JSON containing the alt text quality (good, medium, or poor), a suggested alt text, whether the image is decorative, whether it contains embedded text, contrast issues, and a list of recommendations. Good alt text is concise, descriptive, and conveys the essential information of the image.
6. Automated Photo Classification and Sorting
Claude 3.5 Haiku is sufficient for simple and fast classification. The principle: send each image with a prompt asking to respond with a single word from a list of predefined categories (landscape, portrait, food, animal, architecture, document, screenshot, product, other). A Python script scans a source folder, classifies each image, and copies it into the corresponding subfolder, with a statistical summary at the end of the process.
7. Data Extraction from Charts
Vision LLMs can read charts and extract the underlying data. Via the Anthropic API, the chart is sent with a prompt requesting a structured JSON containing the chart type, title, axes, extracted data, and 2-3 key observations. To understand the cost implications associated with this type of processing, check out our guide on LLM billing (tokens, context, costs).
8. Visual Monitoring and Anomaly Detection
Combine AI vision with a camera or screenshots to detect changes. By sending two images (before/after) to Claude 3.5 Sonnet with a structured prompt, the model returns a JSON indicating whether there are changes, their severity (none/low/medium/high/critical), a detailed list of modifications (addition, deletion, modification), and a summary. Ideal for automated visual monitoring of web pages.
๐ป Using vision with APIs
Anthropic API (Claude)
The Anthropic API accepts base64 images directly in the message's content field. Supported formats: JPEG, PNG, GIF, WebP. Max size: 20 MB per image, up to 100 images per request. You can load an image from a local file or from a URL (via httpx for example).
OpenAI API (GPT-4V / GPT-4o)
The OpenAI API offers two methods: sending via direct URL (the simplest) or in base64 (via a data URI). The detail parameter controls the precision:
| Value | Tokens consumed | Use case |
|---|---|---|
low |
~85 fixed tokens | Quick preview, classification |
high |
~85 + 170รtiles | Detailed analysis, OCR |
auto |
Chosen by the model | Default |
Google API (Gemini)
The google-generativeai SDK allows you to pass a PIL.Image object directly to generate_content, from a local file or from a URL. Gemini's key advantage: a 2M token context, which allows for analyzing entire videos frame by frame.
Via OpenRouter (all models)
If you use OpenRouter, you access all these models via a single API compatible with the OpenAI format. Simply change the base_url and the model name (anthropic/claude-3.5-sonnet, google/gemini-2.0-flash, etc.) to switch providers without modifying the rest of the code.
โก Optimizing performance and costs
Reducing image size
High-resolution images consume a lot of tokens. Before sending, resize with PIL: if the largest dimension exceeds a threshold (e.g., 1024 px), apply a downscaling ratio with Image.LANCZOS, then convert to JPEG with a configurable quality (85 by default). This simple optimization can divide token consumption by 4 to 10.
Vision cost calculation
Claude (Anthropic): tokens are calculated using the formula Tokens โ (width ร height) / 750. A 1000ร1000 image consumes ~1334 tokens, a 4000ร4000 one around ~21334 tokens.
GPT-4o (OpenAI): in low mode, it's a fixed 85 tokens. In high mode, it's 85 + 170 ร the number of 512ร512 tiles. Therefore, a 1024ร1024 consumes ~765 tokens, a 2048ร2048 ~2805 tokens.
Multi-model strategy
Adapt the model to the task: simple classification โ GPT-4o-mini in low mode (~$0.001/image), OCR โ GPT-4o in high mode (~$0.01/image), detailed analysis โ Claude 3.5 Sonnet (~$0.02/image), bulk processing โ Gemini Flash (~$0.001/image). This approach can divide costs by 20 while maintaining quality where it matters.
๐ ๏ธ Practical project: automatic image analyzer
A complete script for automatic folder analysis works like this: it scans a directory, encodes each image in base64 with automatic MIME type detection, sends each image to Claude 3.5 Sonnet with a prompt requesting structured JSON (description, category, detected objects, contained text, dominant colors, mood, quality score, suggestions). The results are aggregated and saved in a rapport_images.json file. As output, you get for each image its short description, its category, and a quality score โ all without manual intervention.
๐ฎ The future of AI vision
2025-2026 trends
- Native video: Gemini already analyzes videos, others will follow
- Real-time vision: camera stream analysis with < 1s latency
- Generation + understanding: models that see AND create images
- Visual agents: agents that navigate interfaces by "looking" at the screen
- Plummeting costs: vision will become virtually free by the end of 2025
Vision + AI Agents
The most powerful combination: an AI agent that sees its environment. OpenClaw already uses vision to:
- Analyze screenshots (browser automation)
- Read images sent by the user
- Visually verify results (QA)
This is the next frontier of autonomous AI: agents that understand the visual world as well as text. To go further on choosing models that integrate these capabilities, check out our article on DeepSeek V4 : deux nouveaux modรจles โ Pro et Flash โ changent la donne.
Common mistakes
- Sending images that are too large: a 4K photo can consume 20,000+ tokens. Systematically resize before sending.
- Using the most expensive model for everything: a GPT-4o-mini is sufficient for classification, no need to pay for Claude 3.5 Sonnet.
- Forgetting the
detailparameter with OpenAI: without specifying it, the model defaults toautoand may over-consume tokens. - Not structuring the output prompt: without asking for a specific JSON, the model returns free text that needs to be parsed manually.
- Ignoring the image format: JPEG is 5 to 10ร lighter than PNG for photos. Prefer it unless you need transparency.
Recommended tools
- OpenRouter โ Access all vision models via a single API
- Claude Anthropic โ The most accurate vision model for detailed analysis
- What is OpenClaw? โ The agent that natively integrates vision
- Automate your life with AI โ Combine vision and automation
- Configure OpenClaw โ Enable your agent's vision capabilities
FAQ
Can multiple images be sent in a single request?
Yes. Claude accepts up to 100 images per request, and OpenAI and Gemini also accept several. This is useful for comparison (mockup vs. implementation) or analyzing multi-page documents.
What resolution should be used?
It depends on the task. For classification, 512ร512 is sufficient. For OCR or detailed analysis, aim for 1024ร1024 to 1568ร1568. Beyond that, the quality gain is marginal but the cost skyrockets.
Are open source models viable for vision?
Llama 3.2 Vision (1120ร1120 max) is usable for simple tasks locally, but it falls far short of proprietary models on complex OCR or fine analysis. For local deployment with sensitive data, however, it is the best option. To learn more about recent open source alternatives, check out our article on Qwen3.6 : Alibaba dรฉbarque avec une nouvelle famille de modรจles LLM.
How should images containing sensitive text be handled?
If the data is sensitive (invoices, legal documents, medical data), prioritize a local model (Llama 3.2 Vision) or ensure that the API provider does not retain the data. Anthropic and OpenAI offer non-retention options for enterprise plans.
Does vision replace traditional OCR?
Not always. Tesseract is free, instantaneous, and works offline. For simple and well-formatted documents, it remains relevant. Vision LLMs, however, excel on complex, handwritten documents, or when it is necessary to understand the semantic context.
Conclusion
AI vision with LLMs is no longer an experiment: it is a ready-to-use production tool. Whether to automate OCR, audit interfaces, classify images, or extract data from charts, models like Claude 3.5 Sonnet, GPT-4o and Gemini 2.0 Flash cover most needs.
The key lies in three practices: choosing the right model based on the task (don't overpay), systematically optimizing image sizes before sending, and structuring output prompts to get directly actionable results. By applying these principles, AI vision becomes a concrete and controlled productivity lever.
```