LLM Integration: OpenAI and Anthropic APIs from Python

We are living through one of the most significant technology shifts in decades. Large language models have gone from academic curiosities to production infrastructure in just a few years, and the developers who understand how to integrate them effectively are in enormous demand right now. Whether you think of LLMs as a fancy autocomplete or as a genuine reasoning engine, the commercial reality is clear: these models are already embedded in customer-facing products, internal tools, and automated pipelines at companies of every size. The question is no longer whether your organization will use them, but how well your team can build with them.
The LLM revolution did not happen overnight, and it is worth understanding what changed. Earlier generations of NLP required massive labeled datasets, custom training pipelines, and deep ML expertise just to build a basic classifier. Then transformer-based models scaled up, and suddenly you could prompt your way to capabilities that would have taken months of specialized engineering just a few years ago. The release of GPT-3 in 2020 was a turning point. Developers realized they could describe a task in plain English and get usable output without writing a single training loop. By the time GPT-4 and Claude arrived, the API-first model had won: you pay for inference, you send prompts, you get completions. No GPUs required, no model weights to manage.
What this means for you as a Python developer is that LLM integration is now a core skill, right alongside database queries and REST API consumption. You need to know how to authenticate with these services, how to structure requests, how to handle errors gracefully, and how to think about costs before they spiral. You also need to understand the meaningful differences between providers, because OpenAI and Anthropic are not interchangeable. They have different API designs, different model personalities, different pricing models, and different strengths. Building a production application means making real choices about which provider fits which part of your system.
This article walks you through both APIs side-by-side, from initial setup through streaming, structured output, tool calling, and production hardening. By the end, you will have working patterns for both providers and enough context to make informed architectural decisions for your own projects.
You've got a killer idea for an app powered by a large language model. Maybe it's a customer support chatbot, a content generator, or a reasoning engine that helps your team make better decisions. The problem? You need to actually integrate an LLM API into your Python application, and you're not sure whether to reach for OpenAI, Anthropic, or how to handle streaming, cost estimation, and error handling when things inevitably break.
This article walks you through both APIs side-by-side, showing you how to build production-ready integrations that handle the real complexities: rate limiting, retry logic, structured outputs, and keeping your API keys safe.
Table of Contents
- Why Both APIs Matter
- Getting Started: API Keys and Environment Setup
- Installing the Official Clients
- OpenAI: Chat Completions Basics
- Anthropic: Messages API
- API Design Differences
- Streaming: Real-Time Token Delivery
- Prompt Engineering Basics
- Prompt Engineering: System Prompts and Few-Shot Learning
- Structured Output: JSON and Tool Calling
- Streaming and Token Management
- Vision and Multimodal Inputs
- Async API Calls for Throughput
- Rate Limiting, Retries, and Error Handling
- Token Counting and Cost Estimation
- Common LLM Integration Mistakes
- Building a Wrapper for Easy Switching
- Practical Tips for Production
- Migration Checklist: OpenAI → Anthropic (or vice versa)
- Summary
Why Both APIs Matter
Here's the thing: OpenAI and Anthropic have different strengths. OpenAI's GPT-4 excels at instruction-following and multimodal tasks. Anthropic's Claude emphasizes safety and reasoning with Constitutional AI. For serious work, you might use both, different tasks call for different tools.
Both offer REST APIs and Python clients, both support streaming, and both charge by tokens. But the implementation details? They're different enough that we need to look at concrete code.
Getting Started: API Keys and Environment Setup
Before any integration work, you need credentials. The good news is that both providers have streamlined their signup processes, and you can have keys in hand within minutes. The important thing at this stage is to establish good security habits from day one, secrets that get committed to version control have a way of being discovered and abused, and rotating compromised keys is a painful process you want to avoid entirely.
For OpenAI:
- Create an account at openai.com
- Navigate to API keys
- Generate a new API key
- Store it in your environment
For Anthropic:
- Create an account at anthropic.com
- Navigate to the console
- Generate an API key
- Store it in your environment
Both providers recommend using environment variables instead of hardcoding keys. Here's the pattern:
import os
from dotenv import load_dotenv
load_dotenv()
OPENAI_API_KEY = os.getenv("OPENAI_API_KEY")
ANTHROPIC_API_KEY = os.getenv("ANTHROPIC_API_KEY")
if not OPENAI_API_KEY or not ANTHROPIC_API_KEY:
raise ValueError("Missing API keys in environment")This pattern loads your .env file at application startup and then reads the keys from the environment. The validation check at the bottom is important: you want your app to fail fast with a clear error at startup rather than discovering a missing key during a production request three hours later.
Use a .env file (never commit it):
OPENAI_API_KEY=sk-proj-...
ANTHROPIC_API_KEY=sk-ant-...
And add .env to your .gitignore. Non-negotiable.
Installing the Official Clients
Both providers maintain official Python packages. Install them:
pip install openai anthropic python-dotenvThe openai package gives you the OpenAI client. The anthropic package gives you the Anthropic client. Both handle HTTP under the hood, you get clean, Pythonic interfaces.
Both clients are actively maintained and version-pinned in their package releases. In production, pin your versions in requirements.txt to prevent unexpected breaking changes from upstream updates. Both providers do occasionally make changes to their clients that require code updates on your end.
OpenAI: Chat Completions Basics
Let's start simple. The Chat Completions endpoint is OpenAI's primary interface for interacting with GPT models, and understanding its structure will make every subsequent pattern easier to follow. The key insight is that OpenAI models are designed around a conversation metaphor: you provide a list of messages with roles, and the model responds as if it is continuing that conversation.
Here's how you make a basic API call to OpenAI:
from openai import OpenAI
client = OpenAI(api_key=OPENAI_API_KEY)
response = client.chat.completions.create(
model="gpt-4",
messages=[
{"role": "system", "content": "You are a helpful assistant."},
{"role": "user", "content": "Explain quantum computing in one sentence."}
],
temperature=0.7,
max_tokens=150
)
print(response.choices[0].message.content)Key parameters:
- model: Which GPT version (
gpt-4,gpt-4-turbo,gpt-3.5-turbo) - messages: List of dicts with
roleandcontentkeys - temperature: Controls randomness (0.0 = deterministic, 1.0 = creative)
- max_tokens: Hard cap on output length
The response object has a structured format: response.choices[0].message.content gives you the text. The choices list exists because OpenAI supports generating multiple candidate responses in a single call using the n parameter, though for most applications you will request exactly one choice and index directly into choices[0].
Anthropic: Messages API
Anthropic's approach is similar but structured differently. The design philosophy is slightly different: Anthropic treats the system prompt as a first-class concept that lives outside the conversation turn structure, which makes the role of system instructions more explicit and easier to reason about. Here's the equivalent:
from anthropic import Anthropic
client = Anthropic(api_key=ANTHROPIC_API_KEY)
message = client.messages.create(
model="claude-3-5-sonnet-20241022",
max_tokens=150,
system="You are a helpful assistant.",
messages=[
{"role": "user", "content": "Explain quantum computing in one sentence."}
]
)
print(message.content[0].text)Notice the differences:
- system is a top-level parameter, not part of messages
- Messages API uses
client.messages.create()instead ofchat.completions.create() - Response structure:
message.content[0].textinstead ofresponse.choices[0].message.content
Why? Anthropic's Messages API is simpler and more explicit. The system prompt is conceptually separate from conversation history, so it lives at the top level. The content field on the response is a list because Anthropic's API supports multi-modal content blocks in the response, text, tool use results, and other types can all appear in the same response, which is why you access the first item with [0] and then read its .text attribute.
API Design Differences
Understanding why these two APIs feel different will help you use both of them more effectively and avoid subtle bugs when switching between providers. The differences are not arbitrary, they reflect distinct design philosophies about how humans and models should interact.
OpenAI's Chat Completions API was designed around the metaphor of a chat interface, which is why everything, including the system prompt, lives inside the messages list. This makes it easy to construct few-shot examples inline and gives you a uniform data structure to reason about. The response uses choices (plural) to support the n parameter, which can generate multiple candidate completions in one call. The entire response structure is optimized for flexibility and composability.
Anthropic's Messages API takes a cleaner separation-of-concerns approach. The system parameter is intentionally top-level because system instructions are not part of the conversation, they are meta-instructions that configure the model's behavior for the entire session. Keeping it separate makes the API contract clearer and reduces the risk of accidental system prompt injection from user-controlled content. The content field being a list of blocks rather than a single string is forward-looking design that handles multi-modal inputs and outputs cleanly.
In practice, these differences show up most clearly in three places: how you handle system prompts (top-level vs. first message), how you parse responses (message.content[0].text vs. response.choices[0].message.content), and how you format tool results when doing function calling. The wrapper pattern we cover later in this article abstracts over these differences so your application code does not have to care which provider is underneath.
Streaming: Real-Time Token Delivery
When you're building user-facing apps, streaming matters. Nobody wants to stare at a loading spinner while 2,000 tokens compile. Both APIs support streaming, tokens arrive as they're generated. The user experience difference between streaming and non-streaming is dramatic: with streaming, the interface feels responsive and alive; without it, every request feels like a page load.
OpenAI streaming:
response = client.chat.completions.create(
model="gpt-4",
messages=[
{"role": "user", "content": "Write a short story about a robot learning to cook."}
],
stream=True
)
for chunk in response:
if chunk.choices[0].delta.content:
print(chunk.choices[0].delta.content, end="", flush=True)You get a generator of chunk objects. Each chunk's .delta.content contains the new tokens. The flush=True ensures real-time display. Note the guard on chunk.choices[0].delta.content, the first and last chunks often have None content, so you need to check before printing to avoid printing the string "None" to your output.
Anthropic streaming:
with client.messages.stream(
model="claude-3-5-sonnet-20241022",
max_tokens=500,
messages=[
{"role": "user", "content": "Write a short story about a robot learning to cook."}
]
) as stream:
for text in stream.text_stream:
print(text, end="", flush=True)Anthropic wraps the stream in a context manager and gives you .text_stream, simpler and more Pythonic. The model accumulates the full message; you can also call stream.get_final_message() to retrieve the complete response after the stream finishes, which is useful when you need both the streamed output for the UI and the full metadata (token counts, stop reason) for logging.
Prompt Engineering Basics
The quality of what you get from an LLM is almost entirely determined by the quality of what you send it. Prompt engineering is the discipline of crafting inputs that reliably produce the outputs you want, and it is genuinely a skill that compounds over time. The good news is that the fundamentals are straightforward, and applying them consistently will dramatically improve your results.
The most important principle is specificity. Vague prompts produce vague outputs. If you want a two-paragraph explanation, say "in two paragraphs." If you want bullet points, say "as a bulleted list." If you want the model to maintain a particular tone or persona, describe it explicitly in your system prompt. The model is not reading your mind, it is pattern-matching against your instructions, so the clearer your instructions, the better the match.
Role prompting is one of the most effective techniques in your toolkit. When you tell the model "You are an expert Python developer with 15 years of experience reviewing production code," you prime it to access the patterns and vocabulary associated with that role. This is not magic, the model has seen enormous amounts of text written by experts, and the role prompt activates those patterns. The difference in output quality between a generic prompt and a well-crafted role prompt can be striking.
Chain-of-thought prompting is another essential technique, especially for reasoning-heavy tasks. Adding "Think step by step before answering" or "Work through this problem carefully, showing your reasoning" causes the model to produce intermediate reasoning steps before arriving at an answer. This dramatically improves accuracy on math problems, logical reasoning, and multi-step planning tasks. The model's extended thinking process effectively becomes part of the output, and you can either display it to the user or strip it out before presenting the final answer.
Few-shot examples work on both APIs the same way: you show the model a few input-output pairs that demonstrate the pattern you want, then give it a new input to complete. This is particularly effective for classification, extraction, and formatting tasks where showing is more efficient than telling. The conversation structure of both APIs makes few-shot prompting natural, you just alternate between user and assistant messages to build up your examples before presenting the real input.
Prompt Engineering: System Prompts and Few-Shot Learning
Both APIs let you craft system prompts, instructions that shape the model's behavior. Here's how few-shot learning works with both. Few-shot examples are especially valuable when you need consistent formatting or when the task is easier to demonstrate than to describe verbally.
OpenAI example (classification task):
response = client.chat.completions.create(
model="gpt-4",
messages=[
{
"role": "system",
"content": "You are a sentiment classifier. Respond with only 'positive', 'negative', or 'neutral'."
},
{"role": "user", "content": "I love this product!"},
{"role": "assistant", "content": "positive"},
{"role": "user", "content": "This is terrible."},
{"role": "assistant", "content": "negative"},
{"role": "user", "content": "It's okay, nothing special."},
{"role": "assistant", "content": "neutral"},
{"role": "user", "content": "Best purchase I've ever made."}
],
temperature=0.0
)
print(response.choices[0].message.content)Anthropic equivalent:
message = client.messages.create(
model="claude-3-5-sonnet-20241022",
max_tokens=10,
system="You are a sentiment classifier. Respond with only 'positive', 'negative', or 'neutral'.",
messages=[
{"role": "user", "content": "I love this product!"},
{"role": "assistant", "content": "positive"},
{"role": "user", "content": "This is terrible."},
{"role": "assistant", "content": "negative"},
{"role": "user", "content": "It's okay, nothing special."},
{"role": "assistant", "content": "neutral"},
{"role": "user", "content": "Best purchase I've ever made."}
]
)
print(message.content[0].text)The structure mirrors the API design: OpenAI mixes system and examples in messages; Anthropic keeps system separate. Both work. Few-shot examples help the model learn patterns without retraining. Notice also that temperature=0.0 and a tight max_tokens budget enforce consistency for classification tasks, you want deterministic outputs here, not creative variation.
Structured Output: JSON and Tool Calling
Sometimes you need the model to return structured data, JSON, database records, whatever. Both APIs support this. Structured outputs are critical when you're building workflows where the LLM feeds data into other systems, no human validation, just automated processing. If a model returns something you cannot parse, your pipeline breaks, so the ability to enforce output format is not a nice-to-have; it is a reliability requirement.
OpenAI with JSON mode:
import json
response = client.chat.completions.create(
model="gpt-4-turbo",
messages=[
{"role": "user", "content": "Extract the name, age, and city from: 'Alice is 28 and lives in Portland.'"}
],
response_format={"type": "json_object"}
)
output = json.loads(response.choices[0].message.content)
print(output) # {"name": "Alice", "age": 28, "city": "Portland"}OpenAI's JSON mode guarantees valid JSON output. The model knows it must return parseable JSON, and the API validates before returning. This is huge for reliability, no more parsing errors from slightly-malformed responses. When using JSON mode, it is best practice to also tell the model in your prompt that you expect JSON, which helps it structure the content correctly even before the API-level enforcement kicks in.
Anthropic with structured output via prefill:
import json
message = client.messages.create(
model="claude-3-5-sonnet-20241022",
max_tokens=500,
messages=[
{"role": "user", "content": "Extract the name, age, and city from: 'Alice is 28 and lives in Portland.' Return valid JSON."},
{"role": "assistant", "content": "{"}
]
)
# Claude continues the JSON since you started it
full_json = "{" + message.content[0].text
output = json.loads(full_json)
print(output)Anthropic's approach: you prefill the assistant's message to guide the format. It's more flexible but requires more finesse. You're essentially saying "start the JSON, now you finish it." Claude respects this constraint and builds on your prefix. This technique works because Claude is trained to be helpful and to continue patterns you establish, starting an open brace signals clearly that you want a JSON object.
A better Anthropic pattern using system prompts:
message = client.messages.create(
model="claude-3-5-sonnet-20241022",
max_tokens=500,
system="You are a data extractor. Always respond with valid JSON only, no markdown, no explanations.",
messages=[
{"role": "user", "content": "Extract name, age, city from: 'Alice is 28 and lives in Portland.'"}
]
)
output = json.loads(message.content[0].text)
print(output)This pattern is cleaner. The system prompt sets expectations, and Claude delivers JSON directly. No prefilling tricks needed. The key phrase "no markdown, no explanations" is important, without it, Claude will sometimes wrap its JSON in a code block, which breaks json.loads().
Tool calling (function calling):
Tool calling is where LLMs become truly agentic. You define functions, describe their purpose, and the model decides whether and when to call them. This enables multi-step reasoning and external tool interaction, the LLM is no longer just generating text; it's orchestrating workflows. The mental model shift here is significant: instead of treating the LLM as a text transformer, you are treating it as a reasoning agent that can decide to take actions in the world.
OpenAI tool calling:
tools = [
{
"type": "function",
"function": {
"name": "get_weather",
"description": "Get the weather for a location",
"parameters": {
"type": "object",
"properties": {
"location": {
"type": "string",
"description": "City name or coordinates"
},
"unit": {
"type": "string",
"enum": ["celsius", "fahrenheit"],
"description": "Temperature unit"
}
},
"required": ["location"]
}
}
},
{
"type": "function",
"function": {
"name": "search_flights",
"description": "Search for flights between two cities",
"parameters": {
"type": "object",
"properties": {
"origin": {"type": "string"},
"destination": {"type": "string"},
"date": {"type": "string", "description": "YYYY-MM-DD format"}
},
"required": ["origin", "destination", "date"]
}
}
}
]
response = client.chat.completions.create(
model="gpt-4",
messages=[{"role": "user", "content": "I want to visit Boston next week. Is the weather good? And find me flights from NYC."}],
tools=tools,
tool_choice="auto"
)
# Process tool calls
for tool_call in response.choices[0].message.tool_calls:
name = tool_call.function.name
args = json.loads(tool_call.function.arguments)
print(f"Model wants to call: {name}({args})")
# Execute the function (your implementation)
if name == "get_weather":
result = get_weather_impl(args["location"], args.get("unit", "fahrenheit"))
elif name == "search_flights":
result = search_flights_impl(args["origin"], args["destination"], args["date"])
# Send the result back to the model for further reasoning
messages.append({"role": "assistant", "content": response.choices[0].message.content})
messages.append({
"role": "user",
"content": f"Result from {name}: {result}"
})Anthropic's equivalent (Messages API with tools):
tools = [
{
"name": "get_weather",
"description": "Get the weather for a location",
"input_schema": {
"type": "object",
"properties": {
"location": {
"type": "string",
"description": "City name or coordinates"
},
"unit": {
"type": "string",
"enum": ["celsius", "fahrenheit"]
}
},
"required": ["location"]
}
}
]
response = client.messages.create(
model="claude-3-5-sonnet-20241022",
max_tokens=1024,
tools=tools,
messages=[
{"role": "user", "content": "I want to visit Boston next week. Is the weather good?"}
]
)
# Check for tool use in the response
for content_block in response.content:
if content_block.type == "tool_use":
tool_name = content_block.name
tool_input = content_block.input
print(f"Claude wants to call: {tool_name}({tool_input})")
# Execute your function
result = get_weather_impl(tool_input["location"], tool_input.get("unit", "fahrenheit"))
# Send result back
messages.append({"role": "assistant", "content": response.content})
messages.append({
"role": "user",
"content": [
{
"type": "tool_result",
"tool_use_id": content_block.id,
"content": str(result)
}
]
})Key differences:
- OpenAI uses
tool_callsattribute; Anthropic usescontent_block.type == "tool_use" - Anthropic schemas use
input_schema; OpenAI usesparameters - Anthropic requires tool results wrapped in
tool_resultblocks with the tool's ID - Both models iterate until they decide they have enough information to answer
This pattern enables complex workflows: the LLM reasons about what it needs, calls tools, gets results, and continues reasoning.
Streaming and Token Management
Streaming is more than a UX enhancement, it is also a tool for managing the economics of LLM API calls. When you stream, you can implement early stopping if the model starts going in the wrong direction, you can enforce length limits by counting tokens on the fly, and you can provide real-time feedback to users instead of making them wait for a complete response. Understanding streaming deeply makes your integrations both more responsive and more cost-efficient.
On the token management side, both providers meter your usage in input tokens (what you send) and output tokens (what the model generates). Output tokens are typically two to four times more expensive than input tokens across both OpenAI and Anthropic's pricing, which means the most effective cost optimization is usually to tighten your max_tokens setting. If you are building a task where the answer should be at most 200 words, set max_tokens=300 rather than max_tokens=4096, you will pay for what you request headroom for.
Streaming also interacts with token counting in a useful way: most providers return usage information at the end of the stream. With OpenAI, the final chunk contains a usage object if you pass stream_options={"include_usage": True}. With Anthropic, calling stream.get_final_message() after the stream closes returns the full message object including usage.input_tokens and usage.output_tokens. Building cost tracking that works with streaming requires capturing this final metadata rather than trying to count tokens yourself.
For conversational applications, you also need to manage context window size. Every time you continue a conversation, you are sending the full history as input tokens. A conversation that has been going for 30 turns can easily consume 10,000+ input tokens per request. Two strategies help here: first, periodically summarize older turns into a compact summary and drop the raw history; second, set a maximum history length and truncate from the oldest end, keeping recent context and the system prompt intact. Both approaches trade off some coherence for reduced cost and latency.
Vision and Multimodal Inputs
Both APIs support images. You can pass images to the model for analysis, OCR, diagram understanding, anything visual. Multimodal capability opens up an entirely new category of application: automated screenshot analysis, document processing, chart interpretation, and visual QA systems that would have required specialized computer vision pipelines just a few years ago.
OpenAI with images:
import base64
# Load image from file
with open("screenshot.png", "rb") as f:
image_data = base64.standard_b64encode(f.read()).decode("utf-8")
response = client.chat.completions.create(
model="gpt-4-vision",
messages=[
{
"role": "user",
"content": [
{"type": "text", "text": "What do you see in this image?"},
{
"type": "image_url",
"image_url": {
"url": f"data:image/png;base64,{image_data}"
}
}
]
}
],
max_tokens=500
)
print(response.choices[0].message.content)You can also pass URLs directly: "url": "https://example.com/image.jpg". URL-based image passing is more efficient than base64 when the images are already hosted, you save the bandwidth of encoding and transmitting the full image in your request body.
Anthropic with images:
import base64
with open("screenshot.png", "rb") as f:
image_data = base64.standard_b64encode(f.read()).decode("utf-8")
message = client.messages.create(
model="claude-3-5-sonnet-20241022",
max_tokens=1024,
messages=[
{
"role": "user",
"content": [
{
"type": "image",
"source": {
"type": "base64",
"media_type": "image/png",
"data": image_data
}
},
{
"type": "text",
"text": "What do you see in this image?"
}
]
}
]
)
print(message.content[0].text)Anthropic also supports URLs: "source": {"type": "url", "url": "https://example.com/image.jpg"}. Note that Anthropic places the image before the text in the content list, while the order may not always matter, placing context before questions is a general best practice that tends to improve response quality.
The underlying pattern is the same: mixed-content messages with text and images. Both models handle diagrams, screenshots, charts, and photos. GPT-4 Vision excels at detailed visual analysis; Claude 3.5 handles it well too.
Async API Calls for Throughput
Processing batches of requests? Use async to make calls in parallel. Both clients support async. This is critical for APIs where latency adds up, if each request takes 2 seconds and you have 100 requests, sequential processing is 200 seconds. Async brings it down to maybe 3–5 seconds depending on your rate limits.
import asyncio
from openai import AsyncOpenAI
async_client = AsyncOpenAI(api_key=OPENAI_API_KEY)
async def process_request(prompt):
response = await async_client.chat.completions.create(
model="gpt-4",
messages=[{"role": "user", "content": prompt}],
max_tokens=100
)
return response.choices[0].message.content
async def main():
prompts = ["What is AI?", "Define machine learning.", "Explain deep learning."]
# Run all requests concurrently
results = await asyncio.gather(
*[process_request(p) for p in prompts],
return_exceptions=True # Don't fail if one request errors
)
for prompt, result in zip(prompts, results):
if isinstance(result, Exception):
print(f"Error for '{prompt}': {result}")
else:
print(f"{prompt}: {result}")
asyncio.run(main())The return_exceptions=True flag ensures one failed request doesn't crash the entire batch. This is especially important when processing hundreds of items, you do not want a single rate limit error to discard all the work that succeeded before it.
Anthropic async:
from anthropic import AsyncAnthropic
async_client = AsyncAnthropic(api_key=ANTHROPIC_API_KEY)
async def process_request(prompt):
message = await async_client.messages.create(
model="claude-3-5-sonnet-20241022",
max_tokens=100,
messages=[{"role": "user", "content": prompt}]
)
return message.content[0].text
async def main():
prompts = ["What is AI?", "Define machine learning.", "Explain deep learning."]
results = await asyncio.gather(
*[process_request(p) for p in prompts],
return_exceptions=True
)
for result in results:
print(result)
asyncio.run(main())Practical tip: Batch processing with semaphores
If you're processing thousands of requests, rate limits become real. Use a semaphore to limit concurrent requests:
async def batch_process_with_limit(prompts, max_concurrent=5):
semaphore = asyncio.Semaphore(max_concurrent)
async def bounded_request(prompt):
async with semaphore:
return await process_request(prompt)
return await asyncio.gather(
*[bounded_request(p) for p in prompts],
return_exceptions=True
)
# Respects rate limits: max 5 concurrent requests
prompts = [f"Summarize document {i}" for i in range(1000)]
results = asyncio.run(batch_process_with_limit(prompts))This prevents overwhelming the API and keeps you under rate limits. Tune max_concurrent based on your tier limits, both OpenAI and Anthropic publish rate limit tables by API tier, and you can request limit increases once your usage justifies it.
Rate Limiting, Retries, and Error Handling
APIs fail. Networks hiccup. Models hit rate limits. Your code at 3 AM will experience all of these. Production code handles this gracefully, or you'll wake up to a paging alarm. The difference between code that handles failures gracefully and code that crashes is the difference between a system that recovers automatically and one that requires manual intervention at 2 AM.
import time
from openai import OpenAI, RateLimitError, APIError, APIConnectionError, APITimeoutError
client = OpenAI(api_key=OPENAI_API_KEY)
def call_with_retry(prompt, max_retries=3, base_delay=1):
for attempt in range(max_retries):
try:
response = client.chat.completions.create(
model="gpt-4",
messages=[{"role": "user", "content": prompt}],
max_tokens=150
)
return response.choices[0].message.content
except RateLimitError as e:
# Hit rate limit: back off and retry
if attempt < max_retries - 1:
delay = base_delay * (2 ** attempt)
print(f"Rate limited (attempt {attempt + 1}). Waiting {delay}s...")
time.sleep(delay)
else:
print(f"Max retries exceeded. Rate limit error: {e}")
raise
except APIConnectionError as e:
# Network issue: usually transient, retry
if attempt < max_retries - 1:
delay = base_delay * (2 ** attempt)
print(f"Connection error. Retrying in {delay}s...")
time.sleep(delay)
else:
print(f"Connection failed after {max_retries} retries: {e}")
raise
except APITimeoutError as e:
# Request timed out: might be worth retrying
if attempt < max_retries - 1:
print(f"Timeout (attempt {attempt + 1}). Retrying...")
time.sleep(base_delay * (2 ** attempt))
else:
print(f"Timeout after {max_retries} retries: {e}")
raise
except APIError as e:
# Generic API error: usually not retryable
print(f"Unrecoverable API error: {e}")
raise
# Should never reach here, but just in case
raise RuntimeError("Unexpected retry loop exit")
result = call_with_retry("What is quantum computing?", max_retries=3)
print(result)This implementation distinguishes between different error types:
- RateLimitError: Your quota is exceeded. Backoff and retry.
- APIConnectionError: Network issue. Transient. Retry.
- APITimeoutError: Request took too long. Might succeed on retry.
- APIError: Catch-all. Usually not retryable (auth failures, invalid requests, etc.).
Exponential backoff (1s, 2s, 4s, ...) prevents hammering the API while it recovers. Check both providers' HTTP response codes if you want to handle specific error types more precisely, a 429 is always a rate limit, a 500 may be a transient server error worth retrying, and a 400 is almost always a bug in your request that no amount of retrying will fix.
Anthropic error handling (similar pattern):
from anthropic import Anthropic, RateLimitError, APIError, APIConnectionError
client = Anthropic(api_key=ANTHROPIC_API_KEY)
def call_with_retry(prompt, max_retries=3):
for attempt in range(max_retries):
try:
message = client.messages.create(
model="claude-3-5-sonnet-20241022",
max_tokens=150,
messages=[{"role": "user", "content": prompt}]
)
return message.content[0].text
except RateLimitError:
if attempt < max_retries - 1:
delay = 2 ** attempt
time.sleep(delay)
else:
raise
except (APIConnectionError, APIError):
if attempt < max_retries - 1:
time.sleep(2 ** attempt)
else:
raiseProduction-grade retry with jitter:
For serious deployments, add jitter to prevent thundering herd (all clients retrying simultaneously):
import random
def call_with_jitter_retry(prompt, max_retries=3, base_delay=1):
for attempt in range(max_retries):
try:
response = client.chat.completions.create(
model="gpt-4",
messages=[{"role": "user", "content": prompt}],
max_tokens=150
)
return response.choices[0].message.content
except RateLimitError:
if attempt < max_retries - 1:
# Exponential backoff + random jitter
delay = base_delay * (2 ** attempt)
jitter = random.uniform(0, delay * 0.1) # 0–10% jitter
time.sleep(delay + jitter)
else:
raiseThe jitter prevents all retries from happening at the exact same moment. This matters most in high-concurrency scenarios where dozens of requests might hit a rate limit simultaneously and then all retry at the same second, causing a second wave of rate limit errors.
Token Counting and Cost Estimation
You're paying per token. Neither free lunch nor unlimited budget, tokens add up fast. Count them before spending money, and track costs in production. A single customer asking for a 10,000-word response can cost you dollars. Scale that to thousands of users and you've got real money on the line. Cost management is not an afterthought; it belongs in your architecture from day one.
OpenAI token counting:
from openai import OpenAI
import tiktoken
client = OpenAI(api_key=OPENAI_API_KEY)
encoding = tiktoken.encoding_for_model("gpt-4")
# Estimate tokens before calling the API
prompts = [
"Explain quantum computing in detail.",
"Write a 500-word essay on climate change.",
"Summarize the history of Python in 100 words."
]
for prompt in prompts:
tokens = encoding.encode(prompt)
print(f"Prompt: '{prompt[:50]}...' → {len(tokens)} tokens")
# Actual API call with cost tracking
response = client.chat.completions.create(
model="gpt-4",
messages=[{"role": "user", "content": "Explain quantum computing in detail."}],
max_tokens=500
)
input_tokens = response.usage.prompt_tokens
output_tokens = response.usage.completion_tokens
total_tokens = response.usage.total_tokens
print(f"Input: {input_tokens}, Output: {output_tokens}, Total: {total_tokens}")
# Cost calculation (current OpenAI pricing)
# GPT-4: $0.03 per 1K input, $0.06 per 1K output
input_cost = (input_tokens / 1000) * 0.03
output_cost = (output_tokens / 1000) * 0.06
total_cost = input_cost + output_cost
print(f"Input cost: ${input_cost:.6f}")
print(f"Output cost: ${output_cost:.6f}")
print(f"Total cost: ${total_cost:.6f}")Anthropic token counting:
Anthropic doesn't provide a pre-API token counter like tiktoken, but you can estimate and verify after the call. For pre-call estimation, a rough rule of thumb is 4 characters per token, which gives you a ballpark figure without an API call. For precision, you rely on the usage metadata that comes back with every response.
from anthropic import Anthropic
client = Anthropic(api_key=ANTHROPIC_API_KEY)
# Make the call (only way to know exact tokens for Anthropic)
message = client.messages.create(
model="claude-3-5-sonnet-20241022",
max_tokens=500,
messages=[{"role": "user", "content": "Explain quantum computing in detail."}]
)
input_tokens = message.usage.input_tokens
output_tokens = message.usage.output_tokens
print(f"Input: {input_tokens}, Output: {output_tokens}")
# Anthropic pricing (subject to change, check their docs)
# Claude 3.5 Sonnet: $3 per 1M input, $15 per 1M output
input_cost = (input_tokens / 1_000_000) * 3
output_cost = (output_tokens / 1_000_000) * 15
total_cost = input_cost + output_cost
print(f"Input cost: ${input_cost:.8f}")
print(f"Output cost: ${output_cost:.8f}")
print(f"Total cost: ${total_cost:.8f}")Cost tracking in production:
import json
from datetime import datetime
def track_cost(model, input_tokens, output_tokens, prompt_summary=""):
# Define pricing per model
pricing = {
"gpt-4": {"input": 0.03, "output": 0.06},
"gpt-3.5-turbo": {"input": 0.0005, "output": 0.0015},
"claude-3-5-sonnet": {"input": 3, "output": 15} # Per million
}
if model not in pricing:
raise ValueError(f"Unknown model: {model}")
rates = pricing[model]
if "claude" in model:
# Anthropic pricing is per million
input_cost = (input_tokens / 1_000_000) * rates["input"]
output_cost = (output_tokens / 1_000_000) * rates["output"]
else:
# OpenAI pricing is per thousand
input_cost = (input_tokens / 1000) * rates["input"]
output_cost = (output_tokens / 1000) * rates["output"]
total_cost = input_cost + output_cost
# Log for billing/analytics
log_entry = {
"timestamp": datetime.utcnow().isoformat(),
"model": model,
"input_tokens": input_tokens,
"output_tokens": output_tokens,
"total_tokens": input_tokens + output_tokens,
"cost_usd": round(total_cost, 6),
"prompt_summary": prompt_summary
}
print(json.dumps(log_entry))
return total_cost
# Usage
track_cost("gpt-4", 150, 200, "Quantum computing explanation")
track_cost("claude-3-5-sonnet", 150, 200, "Quantum computing explanation")Pro tip: Set max_tokens conservatively. If a user can request unlimited output, they can bankrupt you. Cap it per request and per user/day.
Common LLM Integration Mistakes
Even experienced developers make the same mistakes when integrating LLMs for the first time. Learning to recognize these patterns will save you significant debugging time and embarrassing production incidents.
The most expensive mistake is not setting max_tokens. Both APIs will let the model run to its full context window if you omit this parameter, which can mean 4,096 or even 128,000 tokens for a single response. If your application allows user-controlled prompts without a max_tokens guard, a single malicious or accidentally verbose request can generate a response that costs dollars rather than fractions of a cent. Always set this, always.
The second common mistake is hardcoding model names without a configuration layer. Models get updated, deprecated, and repriced frequently. If you have "gpt-4" scattered across fifty files in your codebase and OpenAI releases a superior "gpt-4-turbo" at lower cost, updating it becomes a find-and-replace operation that is guaranteed to miss something. Store model names in a configuration file or environment variable and reference that single source of truth throughout your code.
Ignoring conversation history is a subtler but common issue. Many developers build their first chatbot by sending only the latest user message to the API, resulting in a model that has no memory of the conversation. The model then gives responses that contradict what it said earlier, cannot answer follow-up questions, and fails to maintain context. Building proper conversation history management, appending each turn to a messages list and sending the full history, is fundamental to conversational applications.
Not handling None content in API responses is a frequent source of AttributeError crashes in production. Both APIs can return responses where the content is None due to content filtering, context length issues, or other edge cases. Always check that the response content exists before accessing it, and build graceful degradation paths for when you get an empty response.
Finally, skipping prompt versioning is a mistake you will regret. Prompts are code, they affect output quality just as much as any other logic in your system. When you change a prompt and a metric regresses, you need to be able to roll back. Store prompts in version-controlled files or a database with history, not as bare strings inline in your functions. This becomes critical when you have multiple team members all tweaking prompts independently.
Building a Wrapper for Easy Switching
Once you've worked with both, you'll notice the patterns differ just enough to be annoying. Here's a simple abstraction that lets you swap providers. This pattern becomes especially valuable in multi-provider architectures, where you might use different providers for different tasks or implement fallback logic that switches to a secondary provider if the primary is unavailable.
from abc import ABC, abstractmethod
class LLMProvider(ABC):
@abstractmethod
def complete(self, prompt, system=None, temperature=0.7, max_tokens=500):
pass
class OpenAIProvider(LLMProvider):
def __init__(self, api_key):
from openai import OpenAI
self.client = OpenAI(api_key=api_key)
def complete(self, prompt, system=None, temperature=0.7, max_tokens=500):
messages = []
if system:
messages.append({"role": "system", "content": system})
messages.append({"role": "user", "content": prompt})
response = self.client.chat.completions.create(
model="gpt-4",
messages=messages,
temperature=temperature,
max_tokens=max_tokens
)
return response.choices[0].message.content
class AnthropicProvider(LLMProvider):
def __init__(self, api_key):
from anthropic import Anthropic
self.client = Anthropic(api_key=api_key)
def complete(self, prompt, system=None, temperature=0.7, max_tokens=500):
messages = [{"role": "user", "content": prompt}]
response = self.client.messages.create(
model="claude-3-5-sonnet-20241022",
messages=messages,
system=system,
temperature=temperature,
max_tokens=max_tokens
)
return response.content[0].text
# Usage
provider = OpenAIProvider(OPENAI_API_KEY)
result = provider.complete("Why is the sky blue?", system="You are a scientist.")
# Switch to Anthropic by changing one line
provider = AnthropicProvider(ANTHROPIC_API_KEY)
result = provider.complete("Why is the sky blue?", system="You are a scientist.")This abstraction hides the API differences. Your application calls .complete(), and you choose the provider at runtime. Extend this pattern by adding cost tracking, retry logic, and logging inside each provider's complete method so those concerns are handled uniformly regardless of which backend you are using.
Practical Tips for Production
1. Validate API responses before accessing them:
response = client.chat.completions.create(
model="gpt-4",
messages=[{"role": "user", "content": prompt}]
)
# WRONG: direct access
# text = response.choices[0].message.content
# RIGHT: validate first
if response.choices and len(response.choices) > 0:
text = response.choices[0].message.content
else:
text = None # Handle gracefullyModels occasionally return empty choices (rare but happens). Check before indexing. A defensive approach to response parsing will save you from hard-to-reproduce crashes in production.
2. Log everything for future debugging:
import logging
import json
logging.basicConfig(level=logging.INFO)
logger = logging.getLogger(__name__)
def api_call_with_logging(prompt, model="gpt-4"):
logger.info(f"API call starting. Model: {model}, Prompt length: {len(prompt)}")
try:
response = client.chat.completions.create(
model=model,
messages=[{"role": "user", "content": prompt}],
max_tokens=200
)
logger.info(json.dumps({
"event": "api_success",
"model": model,
"input_tokens": response.usage.prompt_tokens,
"output_tokens": response.usage.completion_tokens,
"cost_usd": (response.usage.prompt_tokens / 1000 * 0.03 +
response.usage.completion_tokens / 1000 * 0.06)
}))
return response.choices[0].message.content
except Exception as e:
logger.error(f"API call failed: {type(e).__name__}: {str(e)}")
raiseStructured logs let you aggregate costs, track error rates, and debug issues.
3. Use timeouts to prevent hanging:
# OpenAI client with timeout
from openai import OpenAI, APITimeoutError
client = OpenAI(api_key=OPENAI_API_KEY, timeout=30.0) # 30-second timeout
try:
response = client.chat.completions.create(
model="gpt-4",
messages=[{"role": "user", "content": "Long computation..."}]
)
except APITimeoutError:
logger.error("API request timed out after 30 seconds")
# Handle gracefully4. Cache responses to save costs and latency:
import hashlib
from functools import lru_cache
@lru_cache(maxsize=128)
def cached_completion(prompt_hash, model="gpt-4"):
# Only called if hash hasn't been seen before
response = client.chat.completions.create(
model=model,
messages=[{"role": "user", "content": prompt}],
max_tokens=200
)
return response.choices[0].message.content
def get_completion(prompt, model="gpt-4"):
prompt_hash = hashlib.md5(prompt.encode()).hexdigest()
return cached_completion(prompt_hash, model)
# First call: hits API
result1 = get_completion("What is Python?")
# Second call: cached, instant
result2 = get_completion("What is Python?")For production, use Redis or another distributed cache.
5. Model selection: cost vs. capability:
| Model | Cost | Speed | Reasoning | Best For |
|---|---|---|---|---|
| gpt-3.5-turbo | $ | Fast | Basic | Development, high-volume tasks |
| gpt-4 | $$$$ | Slow | Excellent | Complex tasks, quality-critical |
| claude-3-5-sonnet | $$ | Fast | Good | General-purpose, balanced |
| claude-opus | $$$$ | Slow | Excellent | Difficult reasoning, highest quality |
During development, use cheap models. Validate logic. For production, upgrade selectively.
6. Handle partial failures gracefully:
def process_batch_with_fallback(prompts):
results = []
for prompt in prompts:
try:
result = call_api_with_timeout(prompt, timeout=10)
results.append(result)
except APITimeoutError:
# Fallback to cheaper, faster model
logger.warning(f"GPT-4 timeout. Falling back to GPT-3.5...")
result = call_api_with_timeout(
prompt, model="gpt-3.5-turbo", timeout=5
)
results.append(result)
except RateLimitError:
# Rate limited; add to retry queue
logger.error(f"Rate limited. Adding to retry queue.")
retry_queue.put(prompt)
results.append(None)
return resultsFallback strategies keep your system resilient.
Migration Checklist: OpenAI → Anthropic (or vice versa)
If you're switching providers:
- Update API key configuration
- Change client initialization (
OpenAI→Anthropic) - Adjust message structure (system prompt location, message roles)
- Update response parsing (
response.choices[0].message.content→message.content[0].text) - Test streaming (both work, but iteration differs slightly)
- Recalibrate temperature/max_tokens (models respond differently)
- Verify tool calling/structured output format
- Adjust cost estimation (different pricing per token)
- Update error handling (exception names differ)
Summary
Integrating LLMs into your Python applications is one of the highest-leverage skills you can develop right now. The patterns are learnable, the APIs are well-documented, and the capabilities you unlock, from natural language interfaces to autonomous tool-using agents, are genuinely transformative for what you can build.
Both OpenAI and Anthropic have earned their places in the ecosystem. OpenAI's API ecosystem is mature with broad tooling support, a massive community, and models that excel at instruction-following and multimodal tasks. Anthropic's Claude models bring strong reasoning capabilities and a design philosophy that makes complex system prompts and structured outputs feel natural. Neither is strictly better, they complement each other, and the developers who understand both will make better architectural decisions than those who default to one provider out of familiarity.
The production patterns covered here, retry logic with exponential backoff, token counting, cost tracking, async batch processing, the provider abstraction wrapper, are not optional polish. They are the difference between a demo and a system that runs reliably at scale. Start with the basics, get something working, and then layer in resilience, observability, and cost management before you hit production.
Build with both. Measure what matters. Keep your max_tokens honest. Store your prompts in version control. And when a model starts returning something unexpected at 2 AM, you will be glad you set up structured logging.
Ready to go deeper? The next article covers building AI agents with LangChain, where these APIs become components in intelligent systems that plan, reason, and take multi-step actions in the world.