Why Both APIs Matter

Here's the thing: OpenAI and Anthropic have different strengths. OpenAI's GPT-4 excels at instruction-following and multimodal tasks. Anthropic's Claude emphasizes safety and reasoning with Constitutional AI. For serious work, you might use both, different tasks call for different tools.

Both offer REST APIs and Python clients, both support streaming, and both charge by tokens. But the implementation details? They're different enough that we need to look at concrete code.

Getting Started: API Keys and Environment Setup

Before any integration work, you need credentials. The good news is that both providers have streamlined their signup processes, and you can have keys in hand within minutes. The important thing at this stage is to establish good security habits from day one, secrets that get committed to version control have a way of being discovered and abused, and rotating compromised keys is a painful process you want to avoid entirely.

For OpenAI:

Create an account at openai.com
Navigate to API keys
Generate a new API key
Store it in your environment

For Anthropic:

Create an account at anthropic.com
Navigate to the console
Generate an API key
Store it in your environment

Both providers recommend using environment variables instead of hardcoding keys. Here's the pattern:

python

import os
from dotenv import load_dotenv
 
load_dotenv()
 
OPENAI_API_KEY = os.getenv("OPENAI_API_KEY")
ANTHROPIC_API_KEY = os.getenv("ANTHROPIC_API_KEY")
 
if not OPENAI_API_KEY or not ANTHROPIC_API_KEY:
    raise ValueError("Missing API keys in environment")

This pattern loads your .env file at application startup and then reads the keys from the environment. The validation check at the bottom is important: you want your app to fail fast with a clear error at startup rather than discovering a missing key during a production request three hours later.

Use a .env file (never commit it):

OPENAI_API_KEY=sk-proj-...
ANTHROPIC_API_KEY=sk-ant-...

And add .env to your .gitignore. Non-negotiable.

Installing the Official Clients

Both providers maintain official Python packages. Install them:

bash

pip install openai anthropic python-dotenv

The openai package gives you the OpenAI client. The anthropic package gives you the Anthropic client. Both handle HTTP under the hood, you get clean, Pythonic interfaces.

Both clients are actively maintained and version-pinned in their package releases. In production, pin your versions in requirements.txt to prevent unexpected breaking changes from upstream updates. Both providers do occasionally make changes to their clients that require code updates on your end.

OpenAI: Chat Completions Basics

Let's start simple. The Chat Completions endpoint is OpenAI's primary interface for interacting with GPT models, and understanding its structure will make every subsequent pattern easier to follow. The key insight is that OpenAI models are designed around a conversation metaphor: you provide a list of messages with roles, and the model responds as if it is continuing that conversation.

Here's how you make a basic API call to OpenAI:

python

from openai import OpenAI
 
client = OpenAI(api_key=OPENAI_API_KEY)
 
response = client.chat.completions.create(
    model="gpt-4",
    messages=[
        {"role": "system", "content": "You are a helpful assistant."},
        {"role": "user", "content": "Explain quantum computing in one sentence."}
    ],
    temperature=0.7,
    max_tokens=150
)
 
print(response.choices[0].message.content)

Key parameters:

model: Which GPT version (gpt-4, gpt-4-turbo, gpt-3.5-turbo)
messages: List of dicts with role and content keys
temperature: Controls randomness (0.0 = deterministic, 1.0 = creative)
max_tokens: Hard cap on output length

The response object has a structured format: response.choices[0].message.content gives you the text. The choices list exists because OpenAI supports generating multiple candidate responses in a single call using the n parameter, though for most applications you will request exactly one choice and index directly into choices[0].

Anthropic: Messages API

Anthropic's approach is similar but structured differently. The design philosophy is slightly different: Anthropic treats the system prompt as a first-class concept that lives outside the conversation turn structure, which makes the role of system instructions more explicit and easier to reason about. Here's the equivalent:

python

from anthropic import Anthropic
 
client = Anthropic(api_key=ANTHROPIC_API_KEY)
 
message = client.messages.create(
    model="claude-3-5-sonnet-20241022",
    max_tokens=150,
    system="You are a helpful assistant.",
    messages=[
        {"role": "user", "content": "Explain quantum computing in one sentence."}
    ]
)
 
print(message.content[0].text)

Notice the differences:

system is a top-level parameter, not part of messages
Messages API uses client.messages.create() instead of chat.completions.create()
Response structure: message.content[0].text instead of response.choices[0].message.content

Why? Anthropic's Messages API is simpler and more explicit. The system prompt is conceptually separate from conversation history, so it lives at the top level. The content field on the response is a list because Anthropic's API supports multi-modal content blocks in the response, text, tool use results, and other types can all appear in the same response, which is why you access the first item with [0] and then read its .text attribute.

API Design Differences

Understanding why these two APIs feel different will help you use both of them more effectively and avoid subtle bugs when switching between providers. The differences are not arbitrary, they reflect distinct design philosophies about how humans and models should interact.

OpenAI's Chat Completions API was designed around the metaphor of a chat interface, which is why everything, including the system prompt, lives inside the messages list. This makes it easy to construct few-shot examples inline and gives you a uniform data structure to reason about. The response uses choices (plural) to support the n parameter, which can generate multiple candidate completions in one call. The entire response structure is optimized for flexibility and composability.

Anthropic's Messages API takes a cleaner separation-of-concerns approach. The system parameter is intentionally top-level because system instructions are not part of the conversation, they are meta-instructions that configure the model's behavior for the entire session. Keeping it separate makes the API contract clearer and reduces the risk of accidental system prompt injection from user-controlled content. The content field being a list of blocks rather than a single string is forward-looking design that handles multi-modal inputs and outputs cleanly.

In practice, these differences show up most clearly in three places: how you handle system prompts (top-level vs. first message), how you parse responses (message.content[0].text vs. response.choices[0].message.content), and how you format tool results when doing function calling. The wrapper pattern we cover later in this article abstracts over these differences so your application code does not have to care which provider is underneath.

Streaming: Real-Time Token Delivery

When you're building user-facing apps, streaming matters. Nobody wants to stare at a loading spinner while 2,000 tokens compile. Both APIs support streaming, tokens arrive as they're generated. The user experience difference between streaming and non-streaming is dramatic: with streaming, the interface feels responsive and alive; without it, every request feels like a page load.

OpenAI streaming:

python

response = client.chat.completions.create(
    model="gpt-4",
    messages=[
        {"role": "user", "content": "Write a short story about a robot learning to cook."}
    ],
    stream=True
)
 
for chunk in response:
    if chunk.choices[0].delta.content:
        print(chunk.choices[0].delta.content, end="", flush=True)

You get a generator of chunk objects. Each chunk's .delta.content contains the new tokens. The flush=True ensures real-time display. Note the guard on chunk.choices[0].delta.content, the first and last chunks often have None content, so you need to check before printing to avoid printing the string "None" to your output.

Anthropic streaming:

python

with client.messages.stream(
    model="claude-3-5-sonnet-20241022",
    max_tokens=500,
    messages=[
        {"role": "user", "content": "Write a short story about a robot learning to cook."}
    ]
) as stream:
    for text in stream.text_stream:
        print(text, end="", flush=True)

Anthropic wraps the stream in a context manager and gives you .text_stream, simpler and more Pythonic. The model accumulates the full message; you can also call stream.get_final_message() to retrieve the complete response after the stream finishes, which is useful when you need both the streamed output for the UI and the full metadata (token counts, stop reason) for logging.

Prompt Engineering Basics

The quality of what you get from an LLM is almost entirely determined by the quality of what you send it. Prompt engineering is the discipline of crafting inputs that reliably produce the outputs you want, and it is genuinely a skill that compounds over time. The good news is that the fundamentals are straightforward, and applying them consistently will dramatically improve your results.

The most important principle is specificity. Vague prompts produce vague outputs. If you want a two-paragraph explanation, say "in two paragraphs." If you want bullet points, say "as a bulleted list." If you want the model to maintain a particular tone or persona, describe it explicitly in your system prompt. The model is not reading your mind, it is pattern-matching against your instructions, so the clearer your instructions, the better the match.

Role prompting is one of the most effective techniques in your toolkit. When you tell the model "You are an expert Python developer with 15 years of experience reviewing production code," you prime it to access the patterns and vocabulary associated with that role. This is not magic, the model has seen enormous amounts of text written by experts, and the role prompt activates those patterns. The difference in output quality between a generic prompt and a well-crafted role prompt can be striking.

Chain-of-thought prompting is another essential technique, especially for reasoning-heavy tasks. Adding "Think step by step before answering" or "Work through this problem carefully, showing your reasoning" causes the model to produce intermediate reasoning steps before arriving at an answer. This dramatically improves accuracy on math problems, logical reasoning, and multi-step planning tasks. The model's extended thinking process effectively becomes part of the output, and you can either display it to the user or strip it out before presenting the final answer.

Few-shot examples work on both APIs the same way: you show the model a few input-output pairs that demonstrate the pattern you want, then give it a new input to complete. This is particularly effective for classification, extraction, and formatting tasks where showing is more efficient than telling. The conversation structure of both APIs makes few-shot prompting natural, you just alternate between user and assistant messages to build up your examples before presenting the real input.

Prompt Engineering: System Prompts and Few-Shot Learning

Both APIs let you craft system prompts, instructions that shape the model's behavior. Here's how few-shot learning works with both. Few-shot examples are especially valuable when you need consistent formatting or when the task is easier to demonstrate than to describe verbally.

OpenAI example (classification task):

python

response = client.chat.completions.create(
    model="gpt-4",
    messages=[
        {
            "role": "system",
            "content": "You are a sentiment classifier. Respond with only 'positive', 'negative', or 'neutral'."
        },
        {"role": "user", "content": "I love this product!"},
        {"role": "assistant", "content": "positive"},
        {"role": "user", "content": "This is terrible."},
        {"role": "assistant", "content": "negative"},
        {"role": "user", "content": "It's okay, nothing special."},
        {"role": "assistant", "content": "neutral"},
        {"role": "user", "content": "Best purchase I've ever made."}
    ],
    temperature=0.0
)
 
print(response.choices[0].message.content)

Anthropic equivalent:

python

message = client.messages.create(
    model="claude-3-5-sonnet-20241022",
    max_tokens=10,
    system="You are a sentiment classifier. Respond with only 'positive', 'negative', or 'neutral'.",
    messages=[
        {"role": "user", "content": "I love this product!"},
        {"role": "assistant", "content": "positive"},
        {"role": "user", "content": "This is terrible."},
        {"role": "assistant", "content": "negative"},
        {"role": "user", "content": "It's okay, nothing special."},
        {"role": "assistant", "content": "neutral"},
        {"role": "user", "content": "Best purchase I've ever made."}
    ]
)
 
print(message.content[0].text)

The structure mirrors the API design: OpenAI mixes system and examples in messages; Anthropic keeps system separate. Both work. Few-shot examples help the model learn patterns without retraining. Notice also that temperature=0.0 and a tight max_tokens budget enforce consistency for classification tasks, you want deterministic outputs here, not creative variation.

Structured Output: JSON and Tool Calling

Sometimes you need the model to return structured data, JSON, database records, whatever. Both APIs support this. Structured outputs are critical when you're building workflows where the LLM feeds data into other systems, no human validation, just automated processing. If a model returns something you cannot parse, your pipeline breaks, so the ability to enforce output format is not a nice-to-have; it is a reliability requirement.

OpenAI with JSON mode:

python

import json
 
response = client.chat.completions.create(
    model="gpt-4-turbo",
    messages=[
        {"role": "user", "content": "Extract the name, age, and city from: 'Alice is 28 and lives in Portland.'"}
    ],
    response_format={"type": "json_object"}
)
 
output = json.loads(response.choices[0].message.content)
print(output)  # {"name": "Alice", "age": 28, "city": "Portland"}

OpenAI's JSON mode guarantees valid JSON output. The model knows it must return parseable JSON, and the API validates before returning. This is huge for reliability, no more parsing errors from slightly-malformed responses. When using JSON mode, it is best practice to also tell the model in your prompt that you expect JSON, which helps it structure the content correctly even before the API-level enforcement kicks in.

Anthropic with structured output via prefill:

python

import json
 
message = client.messages.create(
    model="claude-3-5-sonnet-20241022",
    max_tokens=500,
    messages=[
        {"role": "user", "content": "Extract the name, age, and city from: 'Alice is 28 and lives in Portland.' Return valid JSON."},
        {"role": "assistant", "content": "{"}
    ]
)
 
# Claude continues the JSON since you started it
full_json = "{" + message.content[0].text
output = json.loads(full_json)
print(output)

Anthropic's approach: you prefill the assistant's message to guide the format. It's more flexible but requires more finesse. You're essentially saying "start the JSON, now you finish it." Claude respects this constraint and builds on your prefix. This technique works because Claude is trained to be helpful and to continue patterns you establish, starting an open brace signals clearly that you want a JSON object.

A better Anthropic pattern using system prompts:

python

message = client.messages.create(
    model="claude-3-5-sonnet-20241022",
    max_tokens=500,
    system="You are a data extractor. Always respond with valid JSON only, no markdown, no explanations.",
    messages=[
        {"role": "user", "content": "Extract name, age, city from: 'Alice is 28 and lives in Portland.'"}
    ]
)
 
output = json.loads(message.content[0].text)
print(output)

This pattern is cleaner. The system prompt sets expectations, and Claude delivers JSON directly. No prefilling tricks needed. The key phrase "no markdown, no explanations" is important, without it, Claude will sometimes wrap its JSON in a code block, which breaks json.loads().

Tool calling (function calling):

Tool calling is where LLMs become truly agentic. You define functions, describe their purpose, and the model decides whether and when to call them. This enables multi-step reasoning and external tool interaction, the LLM is no longer just generating text; it's orchestrating workflows. The mental model shift here is significant: instead of treating the LLM as a text transformer, you are treating it as a reasoning agent that can decide to take actions in the world.

OpenAI tool calling:

python

tools = [
    {
        "type": "function",
        "function": {
            "name": "get_weather",
            "description": "Get the weather for a location",
            "parameters": {
                "type": "object",
                "properties": {
                    "location": {
                        "type": "string",
                        "description": "City name or coordinates"
                    },
                    "unit": {
                        "type": "string",
                        "enum": ["celsius", "fahrenheit"],
                        "description": "Temperature unit"
                    }
                },
                "required": ["location"]
            }
        }
    },
    {
        "type": "function",
        "function": {
            "name": "search_flights",
            "description": "Search for flights between two cities",
            "parameters": {
                "type": "object",
                "properties": {
                    "origin": {"type": "string"},
                    "destination": {"type": "string"},
                    "date": {"type": "string", "description": "YYYY-MM-DD format"}
                },
                "required": ["origin", "destination", "date"]
            }
        }
    }
]
 
response = client.chat.completions.create(
    model="gpt-4",
    messages=[{"role": "user", "content": "I want to visit Boston next week. Is the weather good? And find me flights from NYC."}],
    tools=tools,
    tool_choice="auto"
)
 
# Process tool calls
for tool_call in response.choices[0].message.tool_calls:
    name = tool_call.function.name
    args = json.loads(tool_call.function.arguments)
    print(f"Model wants to call: {name}({args})")
 
    # Execute the function (your implementation)
    if name == "get_weather":
        result = get_weather_impl(args["location"], args.get("unit", "fahrenheit"))
    elif name == "search_flights":
        result = search_flights_impl(args["origin"], args["destination"], args["date"])
 
    # Send the result back to the model for further reasoning
    messages.append({"role": "assistant", "content": response.choices[0].message.content})
    messages.append({
        "role": "user",
        "content": f"Result from {name}: {result}"
    })

Anthropic's equivalent (Messages API with tools):

python

tools = [
    {
        "name": "get_weather",
        "description": "Get the weather for a location",
        "input_schema": {
            "type": "object",
            "properties": {
                "location": {
                    "type": "string",
                    "description": "City name or coordinates"
                },
                "unit": {
                    "type": "string",
                    "enum": ["celsius", "fahrenheit"]
                }
            },
            "required": ["location"]
        }
    }
]
 
response = client.messages.create(
    model="claude-3-5-sonnet-20241022",
    max_tokens=1024,
    tools=tools,
    messages=[
        {"role": "user", "content": "I want to visit Boston next week. Is the weather good?"}
    ]
)
 
# Check for tool use in the response
for content_block in response.content:
    if content_block.type == "tool_use":
        tool_name = content_block.name
        tool_input = content_block.input
        print(f"Claude wants to call: {tool_name}({tool_input})")
 
        # Execute your function
        result = get_weather_impl(tool_input["location"], tool_input.get("unit", "fahrenheit"))
 
        # Send result back
        messages.append({"role": "assistant", "content": response.content})
        messages.append({
            "role": "user",
            "content": [
                {
                    "type": "tool_result",
                    "tool_use_id": content_block.id,
                    "content": str(result)
                }
            ]
        })

Key differences:

OpenAI uses tool_calls attribute; Anthropic uses content_block.type == "tool_use"
Anthropic schemas use input_schema; OpenAI uses parameters
Anthropic requires tool results wrapped in tool_result blocks with the tool's ID
Both models iterate until they decide they have enough information to answer

This pattern enables complex workflows: the LLM reasons about what it needs, calls tools, gets results, and continues reasoning.

Streaming and Token Management

Streaming is more than a UX enhancement, it is also a tool for managing the economics of LLM API calls. When you stream, you can implement early stopping if the model starts going in the wrong direction, you can enforce length limits by counting tokens on the fly, and you can provide real-time feedback to users instead of making them wait for a complete response. Understanding streaming deeply makes your integrations both more responsive and more cost-efficient.

On the token management side, both providers meter your usage in input tokens (what you send) and output tokens (what the model generates). Output tokens are typically two to four times more expensive than input tokens across both OpenAI and Anthropic's pricing, which means the most effective cost optimization is usually to tighten your max_tokens setting. If you are building a task where the answer should be at most 200 words, set max_tokens=300 rather than max_tokens=4096, you will pay for what you request headroom for.

Streaming also interacts with token counting in a useful way: most providers return usage information at the end of the stream. With OpenAI, the final chunk contains a usage object if you pass stream_options={"include_usage": True}. With Anthropic, calling stream.get_final_message() after the stream closes returns the full message object including usage.input_tokens and usage.output_tokens. Building cost tracking that works with streaming requires capturing this final metadata rather than trying to count tokens yourself.

For conversational applications, you also need to manage context window size. Every time you continue a conversation, you are sending the full history as input tokens. A conversation that has been going for 30 turns can easily consume 10,000+ input tokens per request. Two strategies help here: first, periodically summarize older turns into a compact summary and drop the raw history; second, set a maximum history length and truncate from the oldest end, keeping recent context and the system prompt intact. Both approaches trade off some coherence for reduced cost and latency.

Vision and Multimodal Inputs

Both APIs support images. You can pass images to the model for analysis, OCR, diagram understanding, anything visual. Multimodal capability opens up an entirely new category of application: automated screenshot analysis, document processing, chart interpretation, and visual QA systems that would have required specialized computer vision pipelines just a few years ago.

OpenAI with images:

python

import base64
 
# Load image from file
with open("screenshot.png", "rb") as f:
    image_data = base64.standard_b64encode(f.read()).decode("utf-8")
 
response = client.chat.completions.create(
    model="gpt-4-vision",
    messages=[
        {
            "role": "user",
            "content": [
                {"type": "text", "text": "What do you see in this image?"},
                {
                    "type": "image_url",
                    "image_url": {
                        "url": f"data:image/png;base64,{image_data}"
                    }
                }
            ]
        }
    ],
    max_tokens=500
)
 
print(response.choices[0].message.content)

You can also pass URLs directly: "url": "https://example.com/image.jpg". URL-based image passing is more efficient than base64 when the images are already hosted, you save the bandwidth of encoding and transmitting the full image in your request body.

Anthropic with images:

python

import base64
 
with open("screenshot.png", "rb") as f:
    image_data = base64.standard_b64encode(f.read()).decode("utf-8")
 
message = client.messages.create(
    model="claude-3-5-sonnet-20241022",
    max_tokens=1024,
    messages=[
        {
            "role": "user",
            "content": [
                {
                    "type": "image",
                    "source": {
                        "type": "base64",
                        "media_type": "image/png",
                        "data": image_data
                    }
                },
                {
                    "type": "text",
                    "text": "What do you see in this image?"
                }
            ]
        }
    ]
)
 
print(message.content[0].text)

Anthropic also supports URLs: "source": {"type": "url", "url": "https://example.com/image.jpg"}. Note that Anthropic places the image before the text in the content list, while the order may not always matter, placing context before questions is a general best practice that tends to improve response quality.

The underlying pattern is the same: mixed-content messages with text and images. Both models handle diagrams, screenshots, charts, and photos. GPT-4 Vision excels at detailed visual analysis; Claude 3.5 handles it well too.

Async API Calls for Throughput

Processing batches of requests? Use async to make calls in parallel. Both clients support async. This is critical for APIs where latency adds up, if each request takes 2 seconds and you have 100 requests, sequential processing is 200 seconds. Async brings it down to maybe 3–5 seconds depending on your rate limits.

python

import asyncio
from openai import AsyncOpenAI
 
async_client = AsyncOpenAI(api_key=OPENAI_API_KEY)
 
async def process_request(prompt):
    response = await async_client.chat.completions.create(
        model="gpt-4",
        messages=[{"role": "user", "content": prompt}],
        max_tokens=100
    )
    return response.choices[0].message.content
 
async def main():
    prompts = ["What is AI?", "Define machine learning.", "Explain deep learning."]
 
    # Run all requests concurrently
    results = await asyncio.gather(
        *[process_request(p) for p in prompts],
        return_exceptions=True  # Don't fail if one request errors
    )
 
    for prompt, result in zip(prompts, results):
        if isinstance(result, Exception):
            print(f"Error for '{prompt}': {result}")
        else:
            print(f"{prompt}: {result}")
 
asyncio.run(main())

The return_exceptions=True flag ensures one failed request doesn't crash the entire batch. This is especially important when processing hundreds of items, you do not want a single rate limit error to discard all the work that succeeded before it.

Anthropic async:

python

from anthropic import AsyncAnthropic
 
async_client = AsyncAnthropic(api_key=ANTHROPIC_API_KEY)
 
async def process_request(prompt):
    message = await async_client.messages.create(
        model="claude-3-5-sonnet-20241022",
        max_tokens=100,
        messages=[{"role": "user", "content": prompt}]
    )
    return message.content[0].text
 
async def main():
    prompts = ["What is AI?", "Define machine learning.", "Explain deep learning."]
    results = await asyncio.gather(
        *[process_request(p) for p in prompts],
        return_exceptions=True
    )
    for result in results:
        print(result)
 
asyncio.run(main())

Practical tip: Batch processing with semaphores

If you're processing thousands of requests, rate limits become real. Use a semaphore to limit concurrent requests:

python

async def batch_process_with_limit(prompts, max_concurrent=5):
    semaphore = asyncio.Semaphore(max_concurrent)
 
    async def bounded_request(prompt):
        async with semaphore:
            return await process_request(prompt)
 
    return await asyncio.gather(
        *[bounded_request(p) for p in prompts],
        return_exceptions=True
    )
 
# Respects rate limits: max 5 concurrent requests
prompts = [f"Summarize document {i}" for i in range(1000)]
results = asyncio.run(batch_process_with_limit(prompts))

This prevents overwhelming the API and keeps you under rate limits. Tune max_concurrent based on your tier limits, both OpenAI and Anthropic publish rate limit tables by API tier, and you can request limit increases once your usage justifies it.

Rate Limiting, Retries, and Error Handling

APIs fail. Networks hiccup. Models hit rate limits. Your code at 3 AM will experience all of these. Production code handles this gracefully, or you'll wake up to a paging alarm. The difference between code that handles failures gracefully and code that crashes is the difference between a system that recovers automatically and one that requires manual intervention at 2 AM.

python

import time
from openai import OpenAI, RateLimitError, APIError, APIConnectionError, APITimeoutError
 
client = OpenAI(api_key=OPENAI_API_KEY)
 
def call_with_retry(prompt, max_retries=3, base_delay=1):
    for attempt in range(max_retries):
        try:
            response = client.chat.completions.create(
                model="gpt-4",
                messages=[{"role": "user", "content": prompt}],
                max_tokens=150
            )
            return response.choices[0].message.content
 
        except RateLimitError as e:
            # Hit rate limit: back off and retry
            if attempt < max_retries - 1:
                delay = base_delay * (2 ** attempt)
                print(f"Rate limited (attempt {attempt + 1}). Waiting {delay}s...")
                time.sleep(delay)
            else:
                print(f"Max retries exceeded. Rate limit error: {e}")
                raise
 
        except APIConnectionError as e:
            # Network issue: usually transient, retry
            if attempt < max_retries - 1:
                delay = base_delay * (2 ** attempt)
                print(f"Connection error. Retrying in {delay}s...")
                time.sleep(delay)
            else:
                print(f"Connection failed after {max_retries} retries: {e}")
                raise
 
        except APITimeoutError as e:
            # Request timed out: might be worth retrying
            if attempt < max_retries - 1:
                print(f"Timeout (attempt {attempt + 1}). Retrying...")
                time.sleep(base_delay * (2 ** attempt))
            else:
                print(f"Timeout after {max_retries} retries: {e}")
                raise
 
        except APIError as e:
            # Generic API error: usually not retryable
            print(f"Unrecoverable API error: {e}")
            raise
 
    # Should never reach here, but just in case
    raise RuntimeError("Unexpected retry loop exit")
 
result = call_with_retry("What is quantum computing?", max_retries=3)
print(result)

This implementation distinguishes between different error types:

RateLimitError: Your quota is exceeded. Backoff and retry.
APIConnectionError: Network issue. Transient. Retry.
APITimeoutError: Request took too long. Might succeed on retry.
APIError: Catch-all. Usually not retryable (auth failures, invalid requests, etc.).

Exponential backoff (1s, 2s, 4s, ...) prevents hammering the API while it recovers. Check both providers' HTTP response codes if you want to handle specific error types more precisely, a 429 is always a rate limit, a 500 may be a transient server error worth retrying, and a 400 is almost always a bug in your request that no amount of retrying will fix.

Anthropic error handling (similar pattern):

python

from anthropic import Anthropic, RateLimitError, APIError, APIConnectionError
 
client = Anthropic(api_key=ANTHROPIC_API_KEY)
 
def call_with_retry(prompt, max_retries=3):
    for attempt in range(max_retries):
        try:
            message = client.messages.create(
                model="claude-3-5-sonnet-20241022",
                max_tokens=150,
                messages=[{"role": "user", "content": prompt}]
            )
            return message.content[0].text
 
        except RateLimitError:
            if attempt < max_retries - 1:
                delay = 2 ** attempt
                time.sleep(delay)
            else:
                raise
 
        except (APIConnectionError, APIError):
            if attempt < max_retries - 1:
                time.sleep(2 ** attempt)
            else:
                raise

Production-grade retry with jitter:

For serious deployments, add jitter to prevent thundering herd (all clients retrying simultaneously):

python

import random
 
def call_with_jitter_retry(prompt, max_retries=3, base_delay=1):
    for attempt in range(max_retries):
        try:
            response = client.chat.completions.create(
                model="gpt-4",
                messages=[{"role": "user", "content": prompt}],
                max_tokens=150
            )
            return response.choices[0].message.content
 
        except RateLimitError:
            if attempt < max_retries - 1:
                # Exponential backoff + random jitter
                delay = base_delay * (2 ** attempt)
                jitter = random.uniform(0, delay * 0.1)  # 0–10% jitter
                time.sleep(delay + jitter)
            else:
                raise

The jitter prevents all retries from happening at the exact same moment. This matters most in high-concurrency scenarios where dozens of requests might hit a rate limit simultaneously and then all retry at the same second, causing a second wave of rate limit errors.

Token Counting and Cost Estimation

You're paying per token. Neither free lunch nor unlimited budget, tokens add up fast. Count them before spending money, and track costs in production. A single customer asking for a 10,000-word response can cost you dollars. Scale that to thousands of users and you've got real money on the line. Cost management is not an afterthought; it belongs in your architecture from day one.

OpenAI token counting:

python

from openai import OpenAI
import tiktoken
 
client = OpenAI(api_key=OPENAI_API_KEY)
encoding = tiktoken.encoding_for_model("gpt-4")
 
# Estimate tokens before calling the API
prompts = [
    "Explain quantum computing in detail.",
    "Write a 500-word essay on climate change.",
    "Summarize the history of Python in 100 words."
]
 
for prompt in prompts:
    tokens = encoding.encode(prompt)
    print(f"Prompt: '{prompt[:50]}...' → {len(tokens)} tokens")
 
# Actual API call with cost tracking
response = client.chat.completions.create(
    model="gpt-4",
    messages=[{"role": "user", "content": "Explain quantum computing in detail."}],
    max_tokens=500
)
 
input_tokens = response.usage.prompt_tokens
output_tokens = response.usage.completion_tokens
total_tokens = response.usage.total_tokens
 
print(f"Input: {input_tokens}, Output: {output_tokens}, Total: {total_tokens}")
 
# Cost calculation (current OpenAI pricing)
# GPT-4: $0.03 per 1K input, $0.06 per 1K output
input_cost = (input_tokens / 1000) * 0.03
output_cost = (output_tokens / 1000) * 0.06
total_cost = input_cost + output_cost
 
print(f"Input cost: ${input_cost:.6f}")
print(f"Output cost: ${output_cost:.6f}")
print(f"Total cost: ${total_cost:.6f}")

Anthropic token counting:

Anthropic doesn't provide a pre-API token counter like tiktoken, but you can estimate and verify after the call. For pre-call estimation, a rough rule of thumb is 4 characters per token, which gives you a ballpark figure without an API call. For precision, you rely on the usage metadata that comes back with every response.

python

from anthropic import Anthropic
 
client = Anthropic(api_key=ANTHROPIC_API_KEY)
 
# Make the call (only way to know exact tokens for Anthropic)
message = client.messages.create(
    model="claude-3-5-sonnet-20241022",
    max_tokens=500,
    messages=[{"role": "user", "content": "Explain quantum computing in detail."}]
)
 
input_tokens = message.usage.input_tokens
output_tokens = message.usage.output_tokens
 
print(f"Input: {input_tokens}, Output: {output_tokens}")
 
# Anthropic pricing (subject to change, check their docs)
# Claude 3.5 Sonnet: $3 per 1M input, $15 per 1M output
input_cost = (input_tokens / 1_000_000) * 3
output_cost = (output_tokens / 1_000_000) * 15
total_cost = input_cost + output_cost
 
print(f"Input cost: ${input_cost:.8f}")
print(f"Output cost: ${output_cost:.8f}")
print(f"Total cost: ${total_cost:.8f}")

Cost tracking in production:

python

import json
from datetime import datetime
 
def track_cost(model, input_tokens, output_tokens, prompt_summary=""):
    # Define pricing per model
    pricing = {
        "gpt-4": {"input": 0.03, "output": 0.06},
        "gpt-3.5-turbo": {"input": 0.0005, "output": 0.0015},
        "claude-3-5-sonnet": {"input": 3, "output": 15}  # Per million
    }
 
    if model not in pricing:
        raise ValueError(f"Unknown model: {model}")
 
    rates = pricing[model]
    if "claude" in model:
        # Anthropic pricing is per million
        input_cost = (input_tokens / 1_000_000) * rates["input"]
        output_cost = (output_tokens / 1_000_000) * rates["output"]
    else:
        # OpenAI pricing is per thousand
        input_cost = (input_tokens / 1000) * rates["input"]
        output_cost = (output_tokens / 1000) * rates["output"]
 
    total_cost = input_cost + output_cost
 
    # Log for billing/analytics
    log_entry = {
        "timestamp": datetime.utcnow().isoformat(),
        "model": model,
        "input_tokens": input_tokens,
        "output_tokens": output_tokens,
        "total_tokens": input_tokens + output_tokens,
        "cost_usd": round(total_cost, 6),
        "prompt_summary": prompt_summary
    }
 
    print(json.dumps(log_entry))
    return total_cost
 
# Usage
track_cost("gpt-4", 150, 200, "Quantum computing explanation")
track_cost("claude-3-5-sonnet", 150, 200, "Quantum computing explanation")

Pro tip: Set max_tokens conservatively. If a user can request unlimited output, they can bankrupt you. Cap it per request and per user/day.

Common LLM Integration Mistakes

Even experienced developers make the same mistakes when integrating LLMs for the first time. Learning to recognize these patterns will save you significant debugging time and embarrassing production incidents.

The most expensive mistake is not setting max_tokens. Both APIs will let the model run to its full context window if you omit this parameter, which can mean 4,096 or even 128,000 tokens for a single response. If your application allows user-controlled prompts without a max_tokens guard, a single malicious or accidentally verbose request can generate a response that costs dollars rather than fractions of a cent. Always set this, always.

The second common mistake is hardcoding model names without a configuration layer. Models get updated, deprecated, and repriced frequently. If you have "gpt-4" scattered across fifty files in your codebase and OpenAI releases a superior "gpt-4-turbo" at lower cost, updating it becomes a find-and-replace operation that is guaranteed to miss something. Store model names in a configuration file or environment variable and reference that single source of truth throughout your code.

Ignoring conversation history is a subtler but common issue. Many developers build their first chatbot by sending only the latest user message to the API, resulting in a model that has no memory of the conversation. The model then gives responses that contradict what it said earlier, cannot answer follow-up questions, and fails to maintain context. Building proper conversation history management, appending each turn to a messages list and sending the full history, is fundamental to conversational applications.

Not handling None content in API responses is a frequent source of AttributeError crashes in production. Both APIs can return responses where the content is None due to content filtering, context length issues, or other edge cases. Always check that the response content exists before accessing it, and build graceful degradation paths for when you get an empty response.

Finally, skipping prompt versioning is a mistake you will regret. Prompts are code, they affect output quality just as much as any other logic in your system. When you change a prompt and a metric regresses, you need to be able to roll back. Store prompts in version-controlled files or a database with history, not as bare strings inline in your functions. This becomes critical when you have multiple team members all tweaking prompts independently.

Building a Wrapper for Easy Switching

Once you've worked with both, you'll notice the patterns differ just enough to be annoying. Here's a simple abstraction that lets you swap providers. This pattern becomes especially valuable in multi-provider architectures, where you might use different providers for different tasks or implement fallback logic that switches to a secondary provider if the primary is unavailable.

python

from abc import ABC, abstractmethod
 
class LLMProvider(ABC):
    @abstractmethod
    def complete(self, prompt, system=None, temperature=0.7, max_tokens=500):
        pass
 
class OpenAIProvider(LLMProvider):
    def __init__(self, api_key):
        from openai import OpenAI
        self.client = OpenAI(api_key=api_key)
 
    def complete(self, prompt, system=None, temperature=0.7, max_tokens=500):
        messages = []
        if system:
            messages.append({"role": "system", "content": system})
        messages.append({"role": "user", "content": prompt})
 
        response = self.client.chat.completions.create(
            model="gpt-4",
            messages=messages,
            temperature=temperature,
            max_tokens=max_tokens
        )
        return response.choices[0].message.content
 
class AnthropicProvider(LLMProvider):
    def __init__(self, api_key):
        from anthropic import Anthropic
        self.client = Anthropic(api_key=api_key)
 
    def complete(self, prompt, system=None, temperature=0.7, max_tokens=500):
        messages = [{"role": "user", "content": prompt}]
 
        response = self.client.messages.create(
            model="claude-3-5-sonnet-20241022",
            messages=messages,
            system=system,
            temperature=temperature,
            max_tokens=max_tokens
        )
        return response.content[0].text
 
# Usage
provider = OpenAIProvider(OPENAI_API_KEY)
result = provider.complete("Why is the sky blue?", system="You are a scientist.")
 
# Switch to Anthropic by changing one line
provider = AnthropicProvider(ANTHROPIC_API_KEY)
result = provider.complete("Why is the sky blue?", system="You are a scientist.")

This abstraction hides the API differences. Your application calls .complete(), and you choose the provider at runtime. Extend this pattern by adding cost tracking, retry logic, and logging inside each provider's complete method so those concerns are handled uniformly regardless of which backend you are using.

Practical Tips for Production

1. Validate API responses before accessing them:

python

response = client.chat.completions.create(
    model="gpt-4",
    messages=[{"role": "user", "content": prompt}]
)
 
# WRONG: direct access
# text = response.choices[0].message.content
 
# RIGHT: validate first
if response.choices and len(response.choices) > 0:
    text = response.choices[0].message.content
else:
    text = None  # Handle gracefully

Models occasionally return empty choices (rare but happens). Check before indexing. A defensive approach to response parsing will save you from hard-to-reproduce crashes in production.

2. Log everything for future debugging:

python

import logging
import json
 
logging.basicConfig(level=logging.INFO)
logger = logging.getLogger(__name__)
 
def api_call_with_logging(prompt, model="gpt-4"):
    logger.info(f"API call starting. Model: {model}, Prompt length: {len(prompt)}")
 
    try:
        response = client.chat.completions.create(
            model=model,
            messages=[{"role": "user", "content": prompt}],
            max_tokens=200
        )
 
        logger.info(json.dumps({
            "event": "api_success",
            "model": model,
            "input_tokens": response.usage.prompt_tokens,
            "output_tokens": response.usage.completion_tokens,
            "cost_usd": (response.usage.prompt_tokens / 1000 * 0.03 +
                        response.usage.completion_tokens / 1000 * 0.06)
        }))
 
        return response.choices[0].message.content
 
    except Exception as e:
        logger.error(f"API call failed: {type(e).__name__}: {str(e)}")
        raise

Structured logs let you aggregate costs, track error rates, and debug issues.

3. Use timeouts to prevent hanging:

python

# OpenAI client with timeout
from openai import OpenAI, APITimeoutError
 
client = OpenAI(api_key=OPENAI_API_KEY, timeout=30.0)  # 30-second timeout
 
try:
    response = client.chat.completions.create(
        model="gpt-4",
        messages=[{"role": "user", "content": "Long computation..."}]
    )
except APITimeoutError:
    logger.error("API request timed out after 30 seconds")
    # Handle gracefully

4. Cache responses to save costs and latency:

python

import hashlib
from functools import lru_cache
 
@lru_cache(maxsize=128)
def cached_completion(prompt_hash, model="gpt-4"):
    # Only called if hash hasn't been seen before
    response = client.chat.completions.create(
        model=model,
        messages=[{"role": "user", "content": prompt}],
        max_tokens=200
    )
    return response.choices[0].message.content
 
def get_completion(prompt, model="gpt-4"):
    prompt_hash = hashlib.md5(prompt.encode()).hexdigest()
    return cached_completion(prompt_hash, model)
 
# First call: hits API
result1 = get_completion("What is Python?")
 
# Second call: cached, instant
result2 = get_completion("What is Python?")

For production, use Redis or another distributed cache.

5. Model selection: cost vs. capability:

Model	Cost	Speed	Reasoning	Best For
gpt-3.5-turbo	$	Fast	Basic	Development, high-volume tasks
gpt-4	$$$$	Slow	Excellent	Complex tasks, quality-critical
claude-3-5-sonnet	$$	Fast	Good	General-purpose, balanced
claude-opus	$$$$	Slow	Excellent	Difficult reasoning, highest quality

During development, use cheap models. Validate logic. For production, upgrade selectively.

6. Handle partial failures gracefully:

python

def process_batch_with_fallback(prompts):
    results = []
 
    for prompt in prompts:
        try:
            result = call_api_with_timeout(prompt, timeout=10)
            results.append(result)
 
        except APITimeoutError:
            # Fallback to cheaper, faster model
            logger.warning(f"GPT-4 timeout. Falling back to GPT-3.5...")
            result = call_api_with_timeout(
                prompt, model="gpt-3.5-turbo", timeout=5
            )
            results.append(result)
 
        except RateLimitError:
            # Rate limited; add to retry queue
            logger.error(f"Rate limited. Adding to retry queue.")
            retry_queue.put(prompt)
            results.append(None)
 
    return results

Fallback strategies keep your system resilient.

Migration Checklist: OpenAI → Anthropic (or vice versa)

If you're switching providers:

Update API key configuration
Change client initialization (OpenAI → Anthropic)
Adjust message structure (system prompt location, message roles)
Update response parsing (response.choices[0].message.content → message.content[0].text)
Test streaming (both work, but iteration differs slightly)
Recalibrate temperature/max_tokens (models respond differently)
Verify tool calling/structured output format
Adjust cost estimation (different pricing per token)
Update error handling (exception names differ)

Summary

Integrating LLMs into your Python applications is one of the highest-leverage skills you can develop right now. The patterns are learnable, the APIs are well-documented, and the capabilities you unlock, from natural language interfaces to autonomous tool-using agents, are genuinely transformative for what you can build.

Both OpenAI and Anthropic have earned their places in the ecosystem. OpenAI's API ecosystem is mature with broad tooling support, a massive community, and models that excel at instruction-following and multimodal tasks. Anthropic's Claude models bring strong reasoning capabilities and a design philosophy that makes complex system prompts and structured outputs feel natural. Neither is strictly better, they complement each other, and the developers who understand both will make better architectural decisions than those who default to one provider out of familiarity.

The production patterns covered here, retry logic with exponential backoff, token counting, cost tracking, async batch processing, the provider abstraction wrapper, are not optional polish. They are the difference between a demo and a system that runs reliably at scale. Start with the basics, get something working, and then layer in resilience, observability, and cost management before you hit production.

Build with both. Measure what matters. Keep your max_tokens honest. Store your prompts in version control. And when a model starts returning something unexpected at 2 AM, you will be glad you set up structured logging.

Ready to go deeper? The next article covers building AI agents with LangChain, where these APIs become components in intelligent systems that plan, reason, and take multi-step actions in the world.

LLM Integration: OpenAI and Anthropic APIs from Python

Why Both APIs Matter

Getting Started: API Keys and Environment Setup

Installing the Official Clients

OpenAI: Chat Completions Basics

Anthropic: Messages API

API Design Differences

Streaming: Real-Time Token Delivery

Prompt Engineering Basics

Prompt Engineering: System Prompts and Few-Shot Learning

Structured Output: JSON and Tool Calling

Streaming and Token Management

Vision and Multimodal Inputs

Async API Calls for Throughput

Rate Limiting, Retries, and Error Handling

Token Counting and Cost Estimation

Common LLM Integration Mistakes

Building a Wrapper for Easy Switching

Practical Tips for Production

Migration Checklist: OpenAI → Anthropic (or vice versa)

Summary

Need help implementing this?