Ever wanted to drop a screenshot into an AI and have it actually understand what you're looking at? Or paste a messy document and get instant analysis? Welcome to multimodal AI, where Claude doesn't just read your words, it sees your images, processes your documents, and reasons about your code all in one go.

This isn't science fiction. It's Claude's bread and butter. But like most powerful tools, there are gotchas, limits, and tricks that separate "it works" from "it works well." Let me walk you through the real story of working with images, code, and documents in Claude.

What Is Multimodal AI (And Why Should You Care)?

Multimodal means Claude can process multiple types of input simultaneously: text, images, PDFs, and code. Instead of describing a chart or transcribing a screenshot, you just send it. Claude's vision model analyzes the image directly and returns insights.

The practical upside? You save time describing things, reduce errors from manual transcription, and unlock analysis patterns that text-only systems can't handle. A screenshot of a UI bug? Claude can see it. A messy PDF form? Claude extracts the data. A code snippet with weird formatting? Claude parses it.

The catch? Images cost tokens. You need to know the limits. And not everything works perfectly. Claude's vision has blind spots, just like yours.

Supported Image Formats and Basic Limits

Claude accepts four image formats: JPEG, PNG, GIF, and WebP. No SVG, no TIFF, no exotic formats. Keep it simple.

Here's what you need to know about limits:

API requests: Up to 100 images per request
claude.ai interface: Up to 20 images per message
Image size: Any dimension works, but bigger images cost more tokens
File size: No hard limit, but larger files take longer to process

The format choice matters less than you think. Use PNG for diagrams (lossless), JPEG for photos (smaller file), and WebP if you want modern compression. GIF works but rarely beats PNG/JPEG.

If you're thinking "100 images per API request sounds like a lot," it is. Most real-world use cases hit the claude.ai limit (20 images) or just process a handful at a time for clarity.

The Hidden Cost: Token Calculation for Images

Here's where most people stumble. Every image costs tokens. The formula is dead simple but unintuitive:

Image tokens = (width × height) / 750

Let me show you what this means in practice.

A 1000×1000 image costs:

(1000 × 1000) / 750 = 1,333 tokens

A 2000×1500 image costs:

(2000 × 1500) / 750 = 4,000 tokens

A screenshot (say, 1920×1080) costs:

(1920 × 1080) / 750 = 2,765 tokens

These aren't text tokens. They're vision tokens. But they count toward your rate limits and your costs. A single large image can cost as much as several hundred words of text.

The Image Cost Reference Table

Dimensions	Aspect Ratio	Tokens	Use Case
800×600	4:3	640	Small screenshot
1024×768	4:3	1,024	Medium diagram
1280×720	16:9	1,229	HD preview
1920×1080	16:9	2,765	Full screenshot
2048×1536	4:3	4,096	High-res image
4096×3072	4:3	16,384	Professional photo

The pattern is obvious: resize before you upload. Crop out irrelevant parts. A 1280×720 screenshot instead of 1920×1080 saves you over 1,500 tokens per image. Process 10 images? You just saved 15,000 tokens.

This is the hidden knowledge that experienced practitioners know. Beginners send massive screenshots. Smart users downscale.

Sending Images via API: The Code

Let's see how this actually works. Here's the basic pattern for sending an image to Claude:

python

import anthropic
import base64
 
client = anthropic.Anthropic()
 
# Read image file and encode to base64
with open("screenshot.png", "rb") as image_file:
    image_data = base64.standard_b64encode(image_file.read()).decode("utf-8")
 
# Send to Claude with vision
message = client.messages.create(
    model="claude-sonnet-4-20250514",
    max_tokens=1024,
    messages=[
        {
            "role": "user",
            "content": [
                {
                    "type": "image",
                    "source": {
                        "type": "base64",
                        "media_type": "image/png",
                        "data": image_data,
                    },
                },
                {
                    "type": "text",
                    "text": "What's in this image? Describe what you see."
                }
            ],
        }
    ],
)
 
print(message.content[0].text)

Break down what's happening here:

Read and encode: Load the image file and base64-encode it. This is required for the API.
Build the message: Create a message with multiple content blocks (an image block and a text block).
Specify media type: Tell Claude whether it's PNG, JPEG, GIF, or WebP.
Send the request: Pass it to Claude like any other message.

The response comes back as text in message.content[0].text. Claude sees the image and responds to your prompt about it.

What's important to understand: You're not uploading the image to a server somewhere. The entire flow happens in your request: base64 encode locally, send to API, get response. It's fast.

Processing Documents and PDFs

Now, what about PDFs and other documents? This is where people get confused.

Claude cannot directly accept PDF files. There's no "type": "pdf" in the message format. So what do you do?

You have two options:

Option 1: Extract Text from PDFs

For PDFs that are mostly text, extract the text and send it as a regular text message:

python

import anthropic
import PyPDF2
 
# Extract text from PDF
with open("document.pdf", "rb") as pdf_file:
    pdf_reader = PyPDF2.PdfReader(pdf_file)
    text = ""
    for page in pdf_reader.pages:
        text += page.extract_text()
 
client = anthropic.Anthropic()
 
message = client.messages.create(
    model="claude-sonnet-4-20250514",
    max_tokens=2048,
    messages=[
        {
            "role": "user",
            "content": f"Here's a PDF I converted to text:\n\n{text}\n\nSummarize the key points."
        }
    ],
)
 
print(message.content[0].text)

This works great for text-heavy PDFs: reports, contracts, manuals, anything with clean text extraction. Claude can analyze thousands of words of text without breaking a sweat.

Option 2: Convert PDF Pages to Images

For PDFs with layouts, images, or scanned documents, convert pages to images first:

python

import anthropic
import base64
from pdf2image import convert_from_path
 
client = anthropic.Anthropic()
 
# Convert PDF to images (one per page)
images = convert_from_path("document.pdf")
 
for page_num, image in enumerate(images):
    # Save as PNG in memory
    image_path = f"page_{page_num}.png"
    image.save(image_path)
 
    # Encode to base64
    with open(image_path, "rb") as f:
        image_data = base64.standard_b64encode(f.read()).decode("utf-8")
 
    # Send to Claude
    message = client.messages.create(
        model="claude-sonnet-4-20250514",
        max_tokens=1024,
        messages=[
            {
                "role": "user",
                "content": [
                    {
                        "type": "image",
                        "source": {
                            "type": "base64",
                            "media_type": "image/png",
                            "data": image_data,
                        },
                    },
                    {
                        "type": "text",
                        "text": f"This is page {page_num + 1}. Extract all text and describe layout."
                    }
                ],
            }
        ],
    )
 
    print(f"Page {page_num + 1}:\n{message.content[0].text}\n")

This handles scanned documents, fancy layouts, or PDFs with embedded images. You convert each page to an image, send it to Claude, and let the vision model do the heavy lifting.

Which approach? Text extraction is cheaper (fewer tokens). Image conversion is more accurate for complex layouts. Use text extraction first, switch to images if Claude misses something.

Working with Code in Prompts

Here's something people don't always realize: Claude's multimodal abilities extend to how you present code. You can send a screenshot of code, paste raw code, or even send a code file.

Raw code in a text block is usually best:

python

# Claude can reason about code directly
message = client.messages.create(
    model="claude-sonnet-4-20250514",
    max_tokens=1024,
    messages=[
        {
            "role": "user",
            "content": """Review this function for bugs:
 
```python
def calculate_discount(price, discount_percent):
    return price * (1 - discount_percent)
```
 
Is this correct?"""
        }
    ],
)

Claude will parse the code block and analyze it. No image needed. This is the fastest approach.

But if you have a screenshot of code (maybe from an IDE or a GitHub page), you can send that too:

python

# Claude can also read code from screenshots
message = client.messages.create(
    model="claude-sonnet-4-20250514",
    max_tokens=1024,
    messages=[
        {
            "role": "user",
            "content": [
                {
                    "type": "image",
                    "source": {
                        "type": "base64",
                        "media_type": "image/png",
                        "data": image_data,  # your base64-encoded screenshot
                    },
                },
                {
                    "type": "text",
                    "text": "What does this code do? Are there any issues?"
                }
            ],
        }
    ],
)

When should you use a screenshot instead of raw code? When the visual formatting matters, like analyzing IDE output, debugging build errors with color highlighting, or reviewing code with git diff colors. Otherwise, paste raw code. It's cheaper and clearer.

Multi-Image Analysis Patterns

The real power of multimodal AI unlocks when you process multiple images together. This is where you compare diagrams, analyze sequences, or correlate visual evidence.

Here's the pattern:

python

import anthropic
import base64
 
client = anthropic.Anthropic()
 
# Prepare multiple images
images_data = []
image_files = ["before.png", "after.png", "diff.png"]
 
for file in image_files:
    with open(file, "rb") as f:
        image_data = base64.standard_b64encode(f.read()).decode("utf-8")
        images_data.append(image_data)
 
# Build message with multiple images
content = []
 
for idx, img_data in enumerate(images_data):
    content.append({
        "type": "image",
        "source": {
            "type": "base64",
            "media_type": "image/png",
            "data": img_data,
        },
    })
 
# Add text after all images
content.append({
    "type": "text",
    "text": """Compare these three images:
1. before.png - Original state
2. after.png - After changes
3. diff.png - Highlighted differences
 
What changed? Are the changes correct?"""
})
 
message = client.messages.create(
    model="claude-sonnet-4-20250514",
    max_tokens=1024,
    messages=[{"role": "user", "content": content}]
)
 
print(message.content[0].text)

Notice the structure: build a content array with multiple image blocks, then add your text question. Claude processes all images in context and can compare them, correlate them, and reason about their relationships.

This pattern works for:

Before/after analysis: UI changes, design iterations, bug fixes
Multi-step sequences: Screenshots from a workflow or tutorial
Comparative analysis: Two competing designs, different implementations
Evidence correlation: Multiple views of the same problem

The limit is 100 images per request (API) or 20 (claude.ai). In practice, 3-5 images is the sweet spot. More than that and your prompt becomes expensive and Claude's response gets unfocused.

Understanding Vision Limitations (The Gotchas)

Claude's vision is powerful but imperfect. Here are the real limitations you'll hit:

No face identification: Claude explicitly cannot identify or name people in images. It can describe that a person is present, their approximate position, their clothing, but not who they are. This is intentional and non-negotiable.

Small text is unreliable: If text is smaller than about 12 pixels, Claude struggles to read it. Screenshots of dense code or tiny UI text will cause problems. Zoom in or increase font size before sending.

Spatial reasoning is fuzzy: Claude can describe what's in an image, but exact measurements, precise coordinates, or complex spatial relationships are hit-or-miss. It's not a replacement for image analysis libraries like OpenCV.

Optical character recognition (OCR) limitations: Claude can read printed text, but handwriting, cursive, or ornate fonts cause failures. Scanned documents with poor quality will struggle.

No color precision: Claude can describe colors generally ("red", "dark blue") but won't nail exact hex values or subtle color differences. Use a color picker tool for precision work.

Image quality matters: Blurry, low-contrast, or heavily compressed images are hard to analyze. Use original quality when possible.

No video: Claude processes static images only. No video files, no animated GIF analysis (though you can send GIFs, they're treated as static).

The pattern here? Claude's vision is great for understanding meaning ("what is this?", "what's wrong?", "does this look right?") but not for measuring or identifying. Use vision for analysis. Use traditional tools for precision.

Best Practices: Getting the Most Out of Multimodal

Now that you understand the mechanics and limits, here's how to actually use this effectively:

Resize images before sending: A 1920×1080 screenshot costs 2,765 tokens. Scale it down to 1280×720 and you're at 1,229 tokens. Save 50% on large batches.

Be specific in your prompt: "What's in this image?" wastes Claude's analysis. "What are the error messages in this screenshot?" focuses the response. Specific prompts = better results.

Include context in text: Don't rely on Claude to infer. If you're asking Claude to review a UI, tell it what you're testing. "This is a login form. The email field is not validating properly. Can you spot the issue?"

Group related images: Send 3-5 related images in one request rather than single images in separate requests. It's cheaper and Claude can correlate them.

Use text extraction for PDFs first: Try text extraction before converting to images. 80% of the time, text extraction works and costs way less.

Test on small batches first: If you're processing 50 images, test with 2-3 first. See what quality you get, what Claude understands. Then scale up.

Monitor token usage: The vision tokens add up fast. A batch of 20 high-res images can cost 50,000+ tokens. Know your budget.

Putting It Together: A Real-World Example

Let's imagine you're auditing a website for accessibility issues. You take 5 screenshots of different pages, resize them to 1280×720 (saves ~35% tokens per image), and want Claude to spot problems.

python

import anthropic
import base64
 
client = anthropic.Anthropic()
 
# Your 5 screenshots, pre-resized
screenshots = ["page1.png", "page2.png", "page3.png", "page4.png", "page5.png"]
 
content = []
 
# Add each screenshot
for idx, filename in enumerate(screenshots, 1):
    with open(filename, "rb") as f:
        image_data = base64.standard_b64encode(f.read()).decode("utf-8")
 
    content.append({
        "type": "image",
        "source": {
            "type": "base64",
            "media_type": "image/png",
            "data": image_data,
        },
    })
 
# Add the analysis request
content.append({
    "type": "text",
    "text": """Review these 5 website screenshots for accessibility issues:
 
Focus on:
- Color contrast ratios
- Text sizing and readability
- Button/link sizing for mobile
- Missing alt text indicators
- Form field labeling
 
For each page, list specific issues and severity (critical/major/minor)."""
})
 
message = client.messages.create(
    model="claude-sonnet-4-20250514",
    max_tokens=2048,
    messages=[{"role": "user", "content": content}]
)
 
print(message.content[0].text)

This is it. The template is complete. You're sending 5 images (about 6,145 tokens) plus your prompt. Claude analyzes all of them in context and gives you a comprehensive audit. Repeat across 50 pages? You're looking at 12 requests, each costing roughly 6,500 tokens. That's less than a full novel of text.

Compare that to manually auditing 50 pages or hiring an accessibility consultant. Multimodal AI pays for itself fast.

Summary and Next Steps

Claude's multimodal capabilities let you process images, documents, and code alongside text. The key takeaways:

Images cost tokens: (width × height) / 750. Resize before sending.
Supported formats: JPEG, PNG, GIF, WebP. No PDFs directly. Extract text or convert to images.
API limits: 100 images per request. claude.ai: 20 images per message.
Vision limitations: No face identification, small text struggles, spatial reasoning is fuzzy.
Best practice: Resize, be specific in prompts, batch related images, test first.

The hidden knowledge? Most people send full-resolution screenshots and waste tokens. Most miss that text extraction is cheaper than image conversion. Most don't realize you can send 5 images in one request and let Claude correlate them. Now you know better.

Working with Multimodal AI: Images, Code, and Documents in Claude

What Is Multimodal AI (And Why Should You Care)?

Supported Image Formats and Basic Limits

The Hidden Cost: Token Calculation for Images

The Image Cost Reference Table

Sending Images via API: The Code

Processing Documents and PDFs

Option 1: Extract Text from PDFs

Option 2: Convert PDF Pages to Images

Working with Code in Prompts

Multi-Image Analysis Patterns

Understanding Vision Limitations (The Gotchas)

Best Practices: Getting the Most Out of Multimodal

Putting It Together: A Real-World Example

Summary and Next Steps

Need help implementing this?