AI Vision: Analyzing Images with LLMs
LLMs are no longer limited to reading text. Multimodal models like Claude 3.5, GPT-4V, and Gemini Pro Vision can see and understand images. OCR, photo analysis, visual QA, design review... the use cases are vast.
In this guide, we explore the available vision models, their strengths, and provide concrete examples using APIs.
👁️ Vision Models in 2025
What is a Multimodal LLM?
A classic LLM takes text as input and produces text as output. A multimodal LLM also accepts images (and sometimes audio or video) as input.
Classic LLM: Text → Text
Multimodal LLM: Text + Image → Text
The model "sees" the image through a visual encoder (often a Vision Transformer, ViT) that converts the image into tokens understandable by the LLM.
Main Models
| Model | Publisher | Max Resolution | Strengths | Price (input) |
|---|---|---|---|---|
| Claude 3.5 Sonnet | Anthropic | 8000×8000 | Detailed analysis, reasoning | ~$3/M tokens |
| Claude 3.5 Haiku | Anthropic | 8000×8000 | Fast, good value | ~$0.80/M tokens |
| GPT-4o | OpenAI | 2048×2048 | Versatile, excellent OCR | ~$2.50/M tokens |
| GPT-4o-mini | OpenAI | 2048×2048 | Budget-friendly, good | ~$0.15/M tokens |
| Gemini 2.0 Flash | Very high | Large context, fast | ~$0.10/M tokens | |
| Gemini 1.5 Pro | Very high | Native video, 2M tokens | ~$1.25/M tokens | |
| Llama 3.2 Vision | Meta | 1120×1120 | Open source, local | Free (self-hosted) |
How to Choose?
Need precise OCR? → GPT-4o or Claude 3.5 Sonnet
Need detailed analysis? → Claude 3.5 Sonnet
Tight budget? → GPT-4o-mini or Gemini Flash
Sensitive data (local)? → Llama 3.2 Vision
Video analysis? → Gemini 1.5 Pro
High volume? → Gemini Flash or GPT-4o-mini
🔍 Concrete Use Cases
1. OCR — Extracting Text from Images
Classic OCR (Tesseract) is limited to well-formatted text. Vision LLMs understand context: they can read a receipt, a handwritten table, or a screenshot with complex layout.
Example: Extracting data from an invoice
import anthropic
import base64
client = anthropic.Anthropic()
# Load the image
with open("invoice.png", "rb") as f:
image_data = base64.standard_b64encode(f.read()).decode("utf-8")
message = client.messages.create(
model="claude-3-5-sonnet-20241022",
max_tokens=1024,
messages=[
{
"role": "user",
"content": [
{
"type": "image",
"source": {
"type": "base64",
"media_type": "image/png",
"data": image_data,
},
},
{
"type": "text",
"text": (
"Extract all information from this invoice "
"in JSON format: number, date, supplier, "
"lines (description, quantity, unit price, total), "
"total before tax, VAT, total including tax."
)
}
],
}
],
)
print(message.content[0].text)
Typical result:
{
"invoice_number": "IN-2025-0142",
"date": "2025-01-15",
"supplier": "TechServ SARL",
"lines": [
{
"description": "VPS Standard Hosting",
"quantity": 1,
"unit_price": 29.99,
"total": 29.99
},
{
"description": ".fr Domain Name",
"quantity": 1,
"unit_price": 12.00,
"total": 12.00
}
],
"total_before_tax": 41.99,
"vat": 8.40,
"total_including_tax": 50.39
}
Comparison of classic OCR vs. LLM Vision:
| Criterion | Tesseract (OCR) | LLM Vision |
|---|---|---|
| Simple printed text | ✅ Excellent | ✅ Excellent |
| Handwritten text | ❌ Poor | ✅ Good |
| Complex tables | ❌ Often fails | ✅ Understands structure |
| Semantic context | ❌ None | ✅ Understands meaning |
| Multilingual | ⚡ With config | ✅ Native |
| Cost | Free | Paid (API) |
| Speed | ⚡ Very fast | 🐢 Slower |
2. Photo Analysis — Understanding Visual Content
Vision LLMs don't just read text; they understand what they see: objects, people, scenes, emotions, style.
Example: Analyzing a product photo
from openai import OpenAI
client = OpenAI()
response = client.chat.completions.create(
model="gpt-4o",
messages=[
{
"role": "user",
"content": [
{
"type": "text",
"text": (
"Analyze this product photo for an e-commerce site. "
"Provide:\n"
"1. Short description (1 sentence)\n"
"2. Long description (SEO, 100 words)\n"
"3. Tags (5-10 keywords)\n"
"4. Dominant colors\n"
"5. Suggestions for improving the photo"
)
},
{
"type": "image_url",
"image_url": {
"url": "https://example.com/product.jpg",
"detail": "high" # high, low, or auto
}
}
],
}
],
max_tokens=500
)
print(response.choices[0].message.content)
3. Design Review and Interface Analysis
A powerful use case: having a UI/UX design reviewed by a vision LLM.
import anthropic
import base64
client = anthropic.Anthropic()
with open("design.png", "rb") as f:
img = base64.standard_b64encode(f.read()).decode("utf-8")
message = client.messages.create(
model="claude-3-5-sonnet-20241022",
max_tokens=2048,
messages=[
{
"role": "user",
"content": [
{
"type": "image",
"source": {
"type": "base64",
"media_type": "image/png",
"data": img,
},
},
{
"type": "text",
"text": """Analyze this web interface design.
Rate out of 10:
1. Visual hierarchy
2. Readability
3. Spacing consistency
4. Accessibility (contrast, text size)
5. Responsive readiness
For each point < 8/10, suggest a concrete improvement.
Identify any potential UX issues."""
}
],
}
],
)
print(message.content[0].text)
Typical result:
## UX Analysis of the Design
### 1. Visual Hierarchy: 7/10
The main title is visible, but secondary CTAs have the same visual weight as the primary CTA.
→ **Suggestion**: reduce the size of secondary buttons, increase the contrast of the primary button.
### 2. Readability: 8/10
Good font size, correct line spacing.
Slight lack of contrast on light gray text (#999) on white background.
→ **Suggestion**: change gray to #666 minimum (4.5:1 ratio).
...
4. Visual QA — Detecting Interface Bugs
Compare a design mockup with a screenshot of the implementation:
import anthropic
import base64
client = anthropic.Anthropic()
def load_image(path):
with open(path, "rb") as f:
return base64.standard_b64encode(f.read()).decode("utf-8")
design = load_image("figma_design.png")
implementation = load_image("prod_screenshot.png")
message = client.messages.create(
model="claude-3-5-sonnet-20241022",
max_tokens=2048,
messages=[
{
"role": "user",
"content": [
{
"type": "text",
"text": "Image 1: the Figma design (reference)"
},
{
"type": "image",
"source": {
"type": "base64",
"media_type": "image/png",
"data": design,
},
},
{
"type": "text",
"text": "Image 2: the screenshot of the implementation"
},
{
"type": "image",
"source": {
"type": "base64",
"media_type": "image/png",
"data": implementation,
},
},
{
"type": "text",
"text": """Compare the design (image 1) and the implementation (image 2).
List ALL visual differences:
- Different colors
- Incorrect spacing
- Missing or extra elements
- Different fonts
- Misaligned elements
- Different sizes
Format: table with columns Element | Design | Implementation | Severity (high/medium/low)"""
}
],
}
],
)
print(message.content[0].text)
💻 Using Vision with APIs
Anthropic API (Claude)
import anthropic
import base64
import httpx
client = anthropic.Anthropic()
# Method 1: Image in base64
with open("image.png", "rb") as f:
image_base64 = base64.standard_b64encode(f.read()).decode("utf-8")
# Method 2: Image from URL
image_url = "https://example.com/image.jpg"
image_data = base64.standard_b64encode(
httpx.get(image_url).content
).decode("utf-8")
message = client.messages.create(
model="claude-3-5-sonnet-20241022",
max_tokens=1024,
messages=[
{
"role": "user",
"content": [
{
"type": "image",
"source": {
"type": "base64",
"media_type": "image/jpeg",
"data": image_data,
},
},
{
"type": "text",
"text": "Describe this image in detail."
}
],
}
],
)
Formats supported by Claude: JPEG, PNG, GIF, WebP
Max size: 20 MB per image, up to 100 images per request
OpenAI API (GPT-4V / GPT-4o)
from openai import OpenAI
client = OpenAI()
# Method 1: Direct URL
response = client.chat.completions.create(
model="gpt-4o",
messages=[
{
"role": "user",
"content": [
{"type": "text", "text": "What do you see?"},
{
"type": "image_url",
"image_url": {
"url": "https://example.com/image.jpg",
"detail": "high" # low, high, auto
}
},
],
}
],
max_tokens=300,
)
# Method 2: Base64
import base64
with open("image.png", "rb") as f:
b64 = base64.b64encode(f.read()).decode()
response = client.chat.completions.create(
model="gpt-4o",
messages=[
{
"role": "user",
"content": [
{"type": "text", "text": "Analyze this image."},
{
"type": "image_url",
"image_url": {
"url": f"data:image/png;base64,{b64}"
}
},
],
}
],
max_tokens=300,
)