Working with Multimodal AI: Images, Code, and Documents in Claude

Ever wanted to drop a screenshot into an AI and have it actually understand what you're looking at? Or paste a messy document and get instant analysis? Welcome to multimodal AI, where Claude doesn't just read your words, it sees your images, processes your documents, and reasons about your code all in one go.
This isn't science fiction. It's Claude's bread and butter. But like most powerful tools, there are gotchas, limits, and tricks that separate "it works" from "it works well." Let me walk you through the real story of working with images, code, and documents in Claude.
Table of Contents
- What Is Multimodal AI (And Why Should You Care)?
- Supported Image Formats and Basic Limits
- The Hidden Cost: Token Calculation for Images
- The Image Cost Reference Table
- Sending Images via API: The Code
- Processing Documents and PDFs
- Option 1: Extract Text from PDFs
- Option 2: Convert PDF Pages to Images
- Working with Code in Prompts
- Multi-Image Analysis Patterns
- Understanding Vision Limitations (The Gotchas)
- Best Practices: Getting the Most Out of Multimodal
- Putting It Together: A Real-World Example
- Summary and Next Steps
What Is Multimodal AI (And Why Should You Care)?
Multimodal means Claude can process multiple types of input simultaneously: text, images, PDFs, and code. Instead of describing a chart or transcribing a screenshot, you just send it. Claude's vision model analyzes the image directly and returns insights.
The practical upside? You save time describing things, reduce errors from manual transcription, and unlock analysis patterns that text-only systems can't handle. A screenshot of a UI bug? Claude can see it. A messy PDF form? Claude extracts the data. A code snippet with weird formatting? Claude parses it.
The catch? Images cost tokens. You need to know the limits. And not everything works perfectly. Claude's vision has blind spots, just like yours.
Supported Image Formats and Basic Limits
Claude accepts four image formats: JPEG, PNG, GIF, and WebP. No SVG, no TIFF, no exotic formats. Keep it simple.
Here's what you need to know about limits:
- API requests: Up to 100 images per request
- claude.ai interface: Up to 20 images per message
- Image size: Any dimension works, but bigger images cost more tokens
- File size: No hard limit, but larger files take longer to process
The format choice matters less than you think. Use PNG for diagrams (lossless), JPEG for photos (smaller file), and WebP if you want modern compression. GIF works but rarely beats PNG/JPEG.
If you're thinking "100 images per API request sounds like a lot," it is. Most real-world use cases hit the claude.ai limit (20 images) or just process a handful at a time for clarity.
The Hidden Cost: Token Calculation for Images
Here's where most people stumble. Every image costs tokens. The formula is dead simple but unintuitive:
Image tokens = (width × height) / 750
Let me show you what this means in practice.
A 1000×1000 image costs:
- (1000 × 1000) / 750 = 1,333 tokens
A 2000×1500 image costs:
- (2000 × 1500) / 750 = 4,000 tokens
A screenshot (say, 1920×1080) costs:
- (1920 × 1080) / 750 = 2,765 tokens
These aren't text tokens. They're vision tokens. But they count toward your rate limits and your costs. A single large image can cost as much as several hundred words of text.
The Image Cost Reference Table
| Dimensions | Aspect Ratio | Tokens | Use Case |
|---|---|---|---|
| 800×600 | 4:3 | 640 | Small screenshot |
| 1024×768 | 4:3 | 1,024 | Medium diagram |
| 1280×720 | 16:9 | 1,229 | HD preview |
| 1920×1080 | 16:9 | 2,765 | Full screenshot |
| 2048×1536 | 4:3 | 4,096 | High-res image |
| 4096×3072 | 4:3 | 16,384 | Professional photo |
The pattern is obvious: resize before you upload. Crop out irrelevant parts. A 1280×720 screenshot instead of 1920×1080 saves you over 1,500 tokens per image. Process 10 images? You just saved 15,000 tokens.
This is the hidden knowledge that experienced practitioners know. Beginners send massive screenshots. Smart users downscale.
Sending Images via API: The Code
Let's see how this actually works. Here's the basic pattern for sending an image to Claude:
import anthropic
import base64
client = anthropic.Anthropic()
# Read image file and encode to base64
with open("screenshot.png", "rb") as image_file:
image_data = base64.standard_b64encode(image_file.read()).decode("utf-8")
# Send to Claude with vision
message = client.messages.create(
model="claude-sonnet-4-20250514",
max_tokens=1024,
messages=[
{
"role": "user",
"content": [
{
"type": "image",
"source": {
"type": "base64",
"media_type": "image/png",
"data": image_data,
},
},
{
"type": "text",
"text": "What's in this image? Describe what you see."
}
],
}
],
)
print(message.content[0].text)Break down what's happening here:
- Read and encode: Load the image file and base64-encode it. This is required for the API.
- Build the message: Create a message with multiple content blocks (an image block and a text block).
- Specify media type: Tell Claude whether it's PNG, JPEG, GIF, or WebP.
- Send the request: Pass it to Claude like any other message.
The response comes back as text in message.content[0].text. Claude sees the image and responds to your prompt about it.
What's important to understand: You're not uploading the image to a server somewhere. The entire flow happens in your request: base64 encode locally, send to API, get response. It's fast.
Processing Documents and PDFs
Now, what about PDFs and other documents? This is where people get confused.
Claude cannot directly accept PDF files. There's no "type": "pdf" in the message format. So what do you do?
You have two options:
Option 1: Extract Text from PDFs
For PDFs that are mostly text, extract the text and send it as a regular text message:
import anthropic
import PyPDF2
# Extract text from PDF
with open("document.pdf", "rb") as pdf_file:
pdf_reader = PyPDF2.PdfReader(pdf_file)
text = ""
for page in pdf_reader.pages:
text += page.extract_text()
client = anthropic.Anthropic()
message = client.messages.create(
model="claude-sonnet-4-20250514",
max_tokens=2048,
messages=[
{
"role": "user",
"content": f"Here's a PDF I converted to text:\n\n{text}\n\nSummarize the key points."
}
],
)
print(message.content[0].text)This works great for text-heavy PDFs: reports, contracts, manuals, anything with clean text extraction. Claude can analyze thousands of words of text without breaking a sweat.
Option 2: Convert PDF Pages to Images
For PDFs with layouts, images, or scanned documents, convert pages to images first:
import anthropic
import base64
from pdf2image import convert_from_path
client = anthropic.Anthropic()
# Convert PDF to images (one per page)
images = convert_from_path("document.pdf")
for page_num, image in enumerate(images):
# Save as PNG in memory
image_path = f"page_{page_num}.png"
image.save(image_path)
# Encode to base64
with open(image_path, "rb") as f:
image_data = base64.standard_b64encode(f.read()).decode("utf-8")
# Send to Claude
message = client.messages.create(
model="claude-sonnet-4-20250514",
max_tokens=1024,
messages=[
{
"role": "user",
"content": [
{
"type": "image",
"source": {
"type": "base64",
"media_type": "image/png",
"data": image_data,
},
},
{
"type": "text",
"text": f"This is page {page_num + 1}. Extract all text and describe layout."
}
],
}
],
)
print(f"Page {page_num + 1}:\n{message.content[0].text}\n")This handles scanned documents, fancy layouts, or PDFs with embedded images. You convert each page to an image, send it to Claude, and let the vision model do the heavy lifting.
Which approach? Text extraction is cheaper (fewer tokens). Image conversion is more accurate for complex layouts. Use text extraction first, switch to images if Claude misses something.
Working with Code in Prompts
Here's something people don't always realize: Claude's multimodal abilities extend to how you present code. You can send a screenshot of code, paste raw code, or even send a code file.
Raw code in a text block is usually best:
# Claude can reason about code directly
message = client.messages.create(
model="claude-sonnet-4-20250514",
max_tokens=1024,
messages=[
{
"role": "user",
"content": """Review this function for bugs:
```python
def calculate_discount(price, discount_percent):
return price * (1 - discount_percent)
```
Is this correct?"""
}
],
)Claude will parse the code block and analyze it. No image needed. This is the fastest approach.
But if you have a screenshot of code (maybe from an IDE or a GitHub page), you can send that too:
# Claude can also read code from screenshots
message = client.messages.create(
model="claude-sonnet-4-20250514",
max_tokens=1024,
messages=[
{
"role": "user",
"content": [
{
"type": "image",
"source": {
"type": "base64",
"media_type": "image/png",
"data": image_data, # your base64-encoded screenshot
},
},
{
"type": "text",
"text": "What does this code do? Are there any issues?"
}
],
}
],
)When should you use a screenshot instead of raw code? When the visual formatting matters, like analyzing IDE output, debugging build errors with color highlighting, or reviewing code with git diff colors. Otherwise, paste raw code. It's cheaper and clearer.
Multi-Image Analysis Patterns
The real power of multimodal AI unlocks when you process multiple images together. This is where you compare diagrams, analyze sequences, or correlate visual evidence.
Here's the pattern:
import anthropic
import base64
client = anthropic.Anthropic()
# Prepare multiple images
images_data = []
image_files = ["before.png", "after.png", "diff.png"]
for file in image_files:
with open(file, "rb") as f:
image_data = base64.standard_b64encode(f.read()).decode("utf-8")
images_data.append(image_data)
# Build message with multiple images
content = []
for idx, img_data in enumerate(images_data):
content.append({
"type": "image",
"source": {
"type": "base64",
"media_type": "image/png",
"data": img_data,
},
})
# Add text after all images
content.append({
"type": "text",
"text": """Compare these three images:
1. before.png - Original state
2. after.png - After changes
3. diff.png - Highlighted differences
What changed? Are the changes correct?"""
})
message = client.messages.create(
model="claude-sonnet-4-20250514",
max_tokens=1024,
messages=[{"role": "user", "content": content}]
)
print(message.content[0].text)Notice the structure: build a content array with multiple image blocks, then add your text question. Claude processes all images in context and can compare them, correlate them, and reason about their relationships.
This pattern works for:
- Before/after analysis: UI changes, design iterations, bug fixes
- Multi-step sequences: Screenshots from a workflow or tutorial
- Comparative analysis: Two competing designs, different implementations
- Evidence correlation: Multiple views of the same problem
The limit is 100 images per request (API) or 20 (claude.ai). In practice, 3-5 images is the sweet spot. More than that and your prompt becomes expensive and Claude's response gets unfocused.
Understanding Vision Limitations (The Gotchas)
Claude's vision is powerful but imperfect. Here are the real limitations you'll hit:
No face identification: Claude explicitly cannot identify or name people in images. It can describe that a person is present, their approximate position, their clothing, but not who they are. This is intentional and non-negotiable.
Small text is unreliable: If text is smaller than about 12 pixels, Claude struggles to read it. Screenshots of dense code or tiny UI text will cause problems. Zoom in or increase font size before sending.
Spatial reasoning is fuzzy: Claude can describe what's in an image, but exact measurements, precise coordinates, or complex spatial relationships are hit-or-miss. It's not a replacement for image analysis libraries like OpenCV.
Optical character recognition (OCR) limitations: Claude can read printed text, but handwriting, cursive, or ornate fonts cause failures. Scanned documents with poor quality will struggle.
No color precision: Claude can describe colors generally ("red", "dark blue") but won't nail exact hex values or subtle color differences. Use a color picker tool for precision work.
Image quality matters: Blurry, low-contrast, or heavily compressed images are hard to analyze. Use original quality when possible.
No video: Claude processes static images only. No video files, no animated GIF analysis (though you can send GIFs, they're treated as static).
The pattern here? Claude's vision is great for understanding meaning ("what is this?", "what's wrong?", "does this look right?") but not for measuring or identifying. Use vision for analysis. Use traditional tools for precision.
Best Practices: Getting the Most Out of Multimodal
Now that you understand the mechanics and limits, here's how to actually use this effectively:
Resize images before sending: A 1920×1080 screenshot costs 2,765 tokens. Scale it down to 1280×720 and you're at 1,229 tokens. Save 50% on large batches.
Be specific in your prompt: "What's in this image?" wastes Claude's analysis. "What are the error messages in this screenshot?" focuses the response. Specific prompts = better results.
Include context in text: Don't rely on Claude to infer. If you're asking Claude to review a UI, tell it what you're testing. "This is a login form. The email field is not validating properly. Can you spot the issue?"
Group related images: Send 3-5 related images in one request rather than single images in separate requests. It's cheaper and Claude can correlate them.
Use text extraction for PDFs first: Try text extraction before converting to images. 80% of the time, text extraction works and costs way less.
Test on small batches first: If you're processing 50 images, test with 2-3 first. See what quality you get, what Claude understands. Then scale up.
Monitor token usage: The vision tokens add up fast. A batch of 20 high-res images can cost 50,000+ tokens. Know your budget.
Putting It Together: A Real-World Example
Let's imagine you're auditing a website for accessibility issues. You take 5 screenshots of different pages, resize them to 1280×720 (saves ~35% tokens per image), and want Claude to spot problems.
import anthropic
import base64
client = anthropic.Anthropic()
# Your 5 screenshots, pre-resized
screenshots = ["page1.png", "page2.png", "page3.png", "page4.png", "page5.png"]
content = []
# Add each screenshot
for idx, filename in enumerate(screenshots, 1):
with open(filename, "rb") as f:
image_data = base64.standard_b64encode(f.read()).decode("utf-8")
content.append({
"type": "image",
"source": {
"type": "base64",
"media_type": "image/png",
"data": image_data,
},
})
# Add the analysis request
content.append({
"type": "text",
"text": """Review these 5 website screenshots for accessibility issues:
Focus on:
- Color contrast ratios
- Text sizing and readability
- Button/link sizing for mobile
- Missing alt text indicators
- Form field labeling
For each page, list specific issues and severity (critical/major/minor)."""
})
message = client.messages.create(
model="claude-sonnet-4-20250514",
max_tokens=2048,
messages=[{"role": "user", "content": content}]
)
print(message.content[0].text)This is it. The template is complete. You're sending 5 images (about 6,145 tokens) plus your prompt. Claude analyzes all of them in context and gives you a comprehensive audit. Repeat across 50 pages? You're looking at 12 requests, each costing roughly 6,500 tokens. That's less than a full novel of text.
Compare that to manually auditing 50 pages or hiring an accessibility consultant. Multimodal AI pays for itself fast.
Summary and Next Steps
Claude's multimodal capabilities let you process images, documents, and code alongside text. The key takeaways:
- Images cost tokens: (width × height) / 750. Resize before sending.
- Supported formats: JPEG, PNG, GIF, WebP. No PDFs directly. Extract text or convert to images.
- API limits: 100 images per request. claude.ai: 20 images per message.
- Vision limitations: No face identification, small text struggles, spatial reasoning is fuzzy.
- Best practice: Resize, be specific in prompts, batch related images, test first.
The hidden knowledge? Most people send full-resolution screenshots and waste tokens. Most miss that text extraction is cheaper than image conversion. Most don't realize you can send 5 images in one request and let Claude correlate them. Now you know better.