Why These Three Formats Still Matter

Before we code, let's be honest about why these formats exist when JSON and XML are "standards."

CSV is universal. Your accountant uses Excel, which exports CSV. Your database exports to CSV. Third-party APIs often default to CSV downloads. It's human-readable, simple, and requires zero dependencies.

Excel (.xlsx files) is where business lives. Reports get built in Excel. Stakeholders send you Excel files with manual edits highlighted in yellow. Yes, it's more complex than CSV, but it handles multiple sheets, named ranges, formulas, and formatting that CSV can't touch.

YAML solves the problem of human-editable configuration. It's more readable than JSON, handles multiline strings without escaping, and supports comments. Kubernetes, Docker Compose, Ansible, all use YAML because it's easier for humans to write and reason about than JSON.

The hidden layer here: you need all three because your users, systems, and stakeholders all speak different languages.

Choosing the Right Format

One of the most consequential decisions you make when designing a data pipeline or a configuration system is which file format to use. It sounds mundane, but choosing the wrong format creates friction that compounds over the lifetime of a project, developers who inherit your code struggle to edit it, stakeholders cannot open the files you generate, and data gets silently corrupted in transit between systems that speak different dialects of the same format.

Here is the decision framework that works in practice. Start by asking who will interact with the file. If the primary audience is a non-technical stakeholder who lives in Microsoft Office, Excel is the right answer, not because it is technically superior, but because it removes friction for the person who matters most. If the file will be machine-generated and machine-consumed with no human in the loop, CSV wins on simplicity and universality. If a developer or system administrator needs to read and edit configuration values directly, YAML gives you comments, readability, and the ability to explain settings inline in a way that JSON simply cannot match.

Next, ask what the data looks like. Flat tabular data with rows and columns is the natural home of CSV. Hierarchical or nested structures belong in YAML (for config) or JSON (for APIs). Data that combines tabular content with rich formatting, multiple sheets, or Excel-specific features like pivot tables belongs in Excel. If you are processing data that will eventually feed into a machine learning pipeline, CSV and Parquet are your workhorses, Excel adds overhead and opaqueness that slows down iteration. For configuration that controls how your pipeline behaves, YAML keeps your settings readable and version-control friendly in a way that a Python dict buried in a config.py file never quite achieves. Understanding these boundaries up front saves you from the painful experience of migrating a production system from one format to another six months after launch.

The CSV Module: Built-In Power (and Pitfalls)

Python's csv module is in the standard library, which means you have zero excuses not to use it properly. But many developers treat it like open() with .split(','), which is how you end up with corrupted data at 2 AM.

Basic Reading: csv.reader vs csv.DictReader

Let's start with the fundamentals. Here's the naive approach you've probably seen:

The temptation to split on commas is understandable, CSV stands for "comma-separated values" after all. But real CSV files are more subtle than the name implies. Fields can contain commas as long as they are quoted, newlines can appear inside quoted fields, and quote characters themselves can appear when doubled. The naive split approach handles none of this. The moment someone puts an address like "123 Main St, Apt 4" in your CSV, your parser falls apart.

python

import csv
 
# DON'T DO THIS
with open('data.csv') as f:
    for line in f:
        fields = line.strip().split(',')
        print(fields)

This breaks immediately when you have commas inside quoted fields. Here's the correct way:

python

import csv
 
# DO THIS - csv.reader
with open('data.csv') as f:
    reader = csv.reader(f)
    for row in reader:
        print(row)  # row is a list

The csv.reader handles quoted fields, escaped commas, and line breaks inside cells automatically. But if your CSV has a header row (which most do), you probably want csv.DictReader:

python

import csv
 
with open('data.csv') as f:
    reader = csv.DictReader(f)
    for row in reader:
        print(row)  # row is a dict like {'name': 'Alice', 'age': '30'}
        print(row['name'])

Key difference: csv.reader gives you lists (positional access), csv.DictReader gives you dicts (named access). For most real-world code, DictReader is cleaner because you're not constantly looking up "is the email in column 2 or column 3?" The named-access approach also makes your code self-documenting and far more resilient to column reordering, a problem that bites teams who receive data from external partners who occasionally restructure their exports.

Writing CSV Files: DictWriter for Sanity

Now let's write data back out. Here's where people create corrupted files:

The same principle that makes manual splitting dangerous also makes manual joining dangerous. The moment you write your own CSV serializer using string formatting, you are one weird data value away from producing a file that other tools cannot parse. Think about what happens when a user's name is O'Brien, Jr., your hand-rolled formatter will produce a syntactically invalid CSV row that breaks every downstream consumer.

python

import csv
 
data = [
    {'name': 'Alice', 'email': 'alice@example.com'},
    {'name': 'Bob', 'email': 'bob@example.com'},
]
 
# WRONG - this will corrupt if email contains commas
with open('output.csv', 'w') as f:
    for row in data:
        f.write(f"{row['name']},{row['email']}\n")

Use csv.DictWriter:

python

import csv
 
data = [
    {'name': 'Alice', 'email': 'alice@example.com'},
    {'name': 'Bob', 'email': 'bob@example.com'},
]
 
with open('output.csv', 'w', newline='') as f:
    writer = csv.DictWriter(f, fieldnames=['name', 'email'])
    writer.writeheader()
    writer.writerows(data)

Important: newline='' when opening the file. The csv module uses this to handle line terminators correctly across Windows, Mac, and Linux. Forget this and you'll get extra blank lines on some systems. This is one of those Python gotchas that exists for historical reasons, the csv module handles its own line termination, and if Python's universal newline handling also runs, you end up with double terminators on Windows that produce those mysterious empty rows.

If you don't want headers (rare but it happens):

python

import csv
 
with open('output.csv', 'w', newline='') as f:
    writer = csv.DictWriter(f, fieldnames=['name', 'email'])
    # Skip writeheader()
    writer.writerows(data)

CSV Edge Cases That Will Bite You

Real-world CSV files are rarely the clean, UTF-8 encoded, comma-delimited, properly-quoted files that tutorials assume. They come from a dozen different source systems, legacy mainframes, European accounting software, spreadsheet exports from users who have no idea what encoding means, and each one brings its own surprises. Understanding these edge cases before you encounter them in production is the difference between spending twenty minutes on a data import and spending two days debugging mysterious corruption.

The most common surprise is delimiter variation. The name "CSV" implies commas, but semicolons are the de facto standard in many European countries where the comma is already reserved as a decimal separator. Tab-separated values are common in scientific and bioinformatics contexts. Pipe-delimited files show up frequently in legacy enterprise systems. When you receive a CSV file and the first row looks like one giant field, your first debugging step should be to open the file in a text editor and check what character is actually separating the values.

Encoding is the second great surprise. CSV has no standard way to declare its own encoding, unlike XML with its <?xml encoding="utf-8"?> declaration or HTML with its meta charset tag, a CSV file is just bytes. Windows systems default to the system code page (often cp1252 in English-speaking regions), Excel has its own preferences, and Linux systems default to UTF-8. The result is that a file that looks fine when generated on one system displays garbage characters on another. The safe approach is always to specify encoding explicitly and to have a fallback strategy when the encoding is unknown.

Case 1: Different delimiters. Not everything uses commas. Semicolon-delimited files are common in European locales where the decimal separator is already a comma.

python

import csv
 
# Reading a semicolon-delimited file
with open('data.csv', encoding='utf-8') as f:
    reader = csv.DictReader(f, delimiter=';')
    for row in reader:
        print(row)
 
# Writing with custom delimiter
with open('output.csv', 'w', newline='', encoding='utf-8') as f:
    writer = csv.DictWriter(f, fieldnames=['name', 'age'], delimiter=';')
    writer.writeheader()
    writer.writerows(data)

Case 2: Quoting behavior. By default, the csv module only quotes fields that contain special characters. But sometimes you need all fields quoted:

python

import csv
 
# QUOTE_ALL quotes every field
with open('output.csv', 'w', newline='') as f:
    writer = csv.DictWriter(
        f,
        fieldnames=['name', 'email'],
        quoting=csv.QUOTE_ALL
    )
    writer.writeheader()
    writer.writerows(data)
 
# Output:
# "name","email"
# "Alice","alice@example.com"

The quoting parameter accepts:

csv.QUOTE_MINIMAL (default): only quote when necessary
csv.QUOTE_ALL: quote every field
csv.QUOTE_NONNUMERIC: quote all non-numeric fields
csv.QUOTE_NONE: don't quote anything (and you must escape delimiters manually)

Case 3: Encoding problems. CSV files from Excel on Windows often use cp1252 encoding, while Linux systems expect utf-8. Mixing these causes corruption:

python

import csv
 
# Reading with explicit encoding
with open('data.csv', encoding='cp1252') as f:
    reader = csv.DictReader(f)
    for row in reader:
        print(row)
 
# If you don't know the encoding, try utf-8 first, then fall back:
encodings = ['utf-8', 'cp1252', 'iso-8859-1']
for encoding in encodings:
    try:
        with open('data.csv', encoding=encoding) as f:
            reader = csv.DictReader(f)
            rows = list(reader)
        print(f"Successfully read with {encoding}")
        break
    except UnicodeDecodeError:
        continue

Case 4: Handling missing fields. If your CSV is inconsistent (some rows have more columns than others), DictReader will either skip them or raise an error depending on settings:

python

import csv
 
with open('data.csv') as f:
    reader = csv.DictReader(f, restval='N/A')  # Default for missing fields
    for row in reader:
        print(row)  # Missing fields will have value 'N/A'

Case 5: Skip comments and blank rows. Real CSV files often have comments or blank rows:

python

import csv
 
with open('data.csv') as f:
    # Skip blank lines
    reader = csv.DictReader(
        (row for row in f if row.strip()),
        delimiter=','
    )
    for row in reader:
        if row.get('name', '').startswith('#'):  # Skip comment rows
            continue
        print(row)

The generator expression (row for row in f if row.strip()) is an elegant pattern that deserves emphasis, it filters blank lines before the CSV parser ever sees them, without loading the entire file into memory. This same pattern can be extended to skip BOM characters at the start of UTF-8 files generated by Excel, which show up as a mysterious \ufeff prefix on the first field name and break header matching in subtle ways.

Excel Files: Beyond CSV with openpyxl

Excel files (.xlsx) are where things get more complex. The openpyxl library gives you full control over Excel's capabilities. Unlike CSV, which is just text with delimiters, an .xlsx file is actually a ZIP archive containing a collection of XML files, that is why you cannot open it in a plain text editor and make sense of it. This architecture gives Excel its power (multiple sheets, formatting, formulas, embedded images) but also means you need a proper library rather than string manipulation.

Basic Reading and Writing

Before you can work with openpyxl, install it with pip install openpyxl. The library follows a workbook/worksheet mental model that mirrors how Excel users think about files, a workbook is the file, a worksheet is a tab within that file, and a cell is the intersection of a row and column.

python

from openpyxl import load_workbook
 
# Load an existing workbook
wb = load_workbook('data.xlsx')
ws = wb.active  # Get the active sheet
 
# Iterate through rows
for row in ws.iter_rows(min_row=2, values_only=True):  # Skip header
    print(row)

The values_only=True parameter gives you just the cell values. Without it, you get cell objects with formatting and formulas.

Here's a more realistic example with named access:

python

from openpyxl import load_workbook
 
wb = load_workbook('data.xlsx')
ws = wb.active
 
# Read header
headers = [cell.value for cell in ws[1]]
 
# Read data rows
data = []
for row in ws.iter_rows(min_row=2, values_only=True):
    if any(row):  # Skip completely empty rows
        data.append(dict(zip(headers, row)))
 
for record in data:
    print(record)

The any(row) check is a subtle but important detail, Excel files often contain rows that appear empty to the user but actually contain formatting or previously-deleted content, so checking that at least one cell has a value prevents you from processing ghost rows that produce empty dicts downstream.

Writing to Excel:

python

from openpyxl import Workbook
 
wb = Workbook()
ws = wb.active
ws.title = 'Data'
 
# Write header
headers = ['name', 'email', 'age']
ws.append(headers)
 
# Write data
data = [
    ['Alice', 'alice@example.com', 30],
    ['Bob', 'bob@example.com', 25],
]
for row in data:
    ws.append(row)
 
wb.save('output.xlsx')

Notice how ws.append() handles both the header row and each data row identically, it always adds a new row at the end of the current data. This is the cleanest way to build up a sheet incrementally, and it avoids the row-numbering math that causes off-by-one errors when you try to track the current row manually.

Working with Multiple Sheets

Real Excel files often have multiple sheets. You need to handle them deliberately:

python

from openpyxl import load_workbook
 
wb = load_workbook('data.xlsx')
 
# List all sheets
print(wb.sheetnames)  # ['Sales', 'Expenses', 'Summary']
 
# Access specific sheet
sales_ws = wb['Sales']
expenses_ws = wb['Expenses']
 
# Iterate all sheets
for sheet_name in wb.sheetnames:
    ws = wb[sheet_name]
    print(f"Sheet: {sheet_name}")
    for row in ws.iter_rows(values_only=True):
        print(row)

Creating a new workbook with multiple sheets:

python

from openpyxl import Workbook
 
wb = Workbook()
wb.remove(wb.active)  # Remove default empty sheet
 
# Create sheets
sales_ws = wb.create_sheet('Sales')
expenses_ws = wb.create_sheet('Expenses')
 
# Write data to each sheet
sales_ws.append(['Product', 'Revenue'])
sales_ws.append(['Widget A', 10000])
sales_ws.append(['Widget B', 15000])
 
expenses_ws.append(['Category', 'Amount'])
expenses_ws.append(['Salaries', 50000])
 
wb.save('report.xlsx')

The wb.remove(wb.active) call at the start is a pattern worth remembering. When you create a new Workbook, openpyxl automatically creates one empty sheet called "Sheet". If you are building a multi-sheet report programmatically, that default sheet is dead weight, removing it keeps your output clean and prevents confusion when stakeholders open the file.

Named Ranges and Cell Addressing

Excel's "named ranges" are useful for complex spreadsheets. Here's how to use them:

python

from openpyxl import load_workbook
 
wb = load_workbook('data.xlsx')
ws = wb.active
 
# Define a named range
ws['A1'] = 'Total'
ws['A2'] = 100
wb.defined_names['summary_total'] = 'A2'
 
# Access a named range
cell_ref = wb.defined_names['summary_total']
print(cell_ref)  # 'Sheet!$A$2'
 
# Reading cell by address
value = ws['B5'].value  # Cell B5
value = ws.cell(row=5, column=2).value  # Same thing, row/col format

Both addressing styles, ws['B5'] and ws.cell(row=5, column=2), access exactly the same cell. The letter-based notation is natural when you are working from a known cell reference, while the row/column integer form is better when you are iterating programmatically and calculating positions. You will often mix both styles in the same codebase, and that is fine.

Formatting and Formulas

openpyxl lets you add formatting and formulas:

python

from openpyxl import Workbook
from openpyxl.styles import Font, PatternFill, Alignment
from openpyxl.utils import get_column_letter
 
wb = Workbook()
ws = wb.active
 
# Add header with formatting
header_font = Font(bold=True, color='FFFFFF')
header_fill = PatternFill(start_color='366092', end_color='366092', fill_type='solid')
header_alignment = Alignment(horizontal='center')
 
for col, header in enumerate(['Name', 'Sales', 'Commission'], 1):
    cell = ws.cell(row=1, column=col)
    cell.value = header
    cell.font = header_font
    cell.fill = header_fill
    cell.alignment = header_alignment
 
# Add data with formula
ws['A2'] = 'Alice'
ws['B2'] = 100000
ws['C2'] = '=B2*0.1'  # Formula: 10% commission
 
ws['A3'] = 'Bob'
ws['B3'] = 75000
ws['C3'] = '=B3*0.1'
 
wb.save('report.xlsx')

Formulas in openpyxl are just strings that start with =. The library does not evaluate them, it writes them to the file exactly as you specify, and Excel evaluates them when the file is opened. This means you can write any valid Excel formula, but you cannot read back the computed result without separately loading the file with data_only=True.

Reading from Large Excel Files

If you're dealing with huge Excel files, use read_only=True to stream data instead of loading everything into memory:

python

from openpyxl import load_workbook
 
# Use read_only mode for huge files
wb = load_workbook('huge_file.xlsx', read_only=True)
ws = wb.active
 
for row in ws.iter_rows(values_only=True):
    print(row)
    # Process row without loading entire file into memory

The performance difference between normal and read-only mode becomes dramatic above a few thousand rows. In normal mode, openpyxl loads the entire workbook into memory as a tree of Python objects, a 100,000-row file can consume gigabytes of RAM and take minutes to load. In read-only mode, it streams the underlying XML and yields rows one at a time, keeping memory usage essentially constant regardless of file size. The tradeoff is that you cannot modify cells in read-only mode, but for data extraction pipelines you almost never need to.

Excel Automation Patterns

Beyond basic reading and writing, Excel automation in production systems follows a set of recurring patterns that are worth having in your toolkit. The most common scenario is report generation, you receive data from a database or API, and you need to produce an Excel report that stakeholders can work with in their existing workflows. The key insight here is that a well-generated Excel file does more than just hold data, it guides the user by highlighting important values, providing dropdown validation to prevent data entry errors, and using named sheets to organize complex information into a navigable structure.

A pattern that scales well is the template-based approach: maintain a hand-crafted Excel template with all your formatting, named ranges, and formulas already set up, then use openpyxl to open the template and fill in the data cells. This separates the design concern (which belongs to whoever owns the report format) from the data concern (which belongs to your pipeline). When stakeholders want to change the report layout, they modify the template file, not your code. When you need to update the underlying data logic, you modify the Python script, not the template.

Another important pattern is batch generation. If you need to produce dozens or hundreds of personalized Excel reports, one per sales rep, one per client, one per region, the pattern of loading a base workbook, cloning or filling it, and saving to a new filename is exactly what you want. Combine this with Python's multiprocessing module and you can generate hundreds of reports in parallel, each in a separate process to avoid openpyxl's non-thread-safe internals.

Error recovery is a third pattern that matters in production. When generating large reports, individual rows may contain data that causes formatting issues, a date value that is actually a string, a number stored as text, a cell reference that points to a non-existent sheet. Build your Excel writing code to catch and log these failures per-row rather than letting a single bad record abort the entire report generation. The user who receives a report with one missing row is far happier than the user who receives no report at all.

YAML: Configuration and Human-Readable Data

YAML is designed for humans to write and read. It's less verbose than JSON and supports comments, but it has security implications you must understand. The format's power comes from its indentation-based structure, there are no curly braces or square brackets cluttering the visual hierarchy, just indentation and colons. This makes it exceptionally readable for nested configuration, but it also means that indentation errors produce subtle bugs that are hard to debug without a proper YAML linter.

Basic YAML Reading and Writing

python

import yaml
 
# Read YAML
with open('config.yaml') as f:
    config = yaml.safe_load(f)
 
print(config)
# {'database': {'host': 'localhost', 'port': 5432}, 'debug': True}

Here's the most important rule: always use safe_load(), never load().

The reason: yaml.load() can execute arbitrary Python code embedded in the YAML file. This is a critical security vulnerability. If you load YAML from untrusted sources (user uploads, external APIs), an attacker can inject code that gets executed:

python

import yaml
 
# DANGEROUS - never do this with untrusted YAML
dangerous_yaml = """
!!python/object/apply:os.system
args: ['rm -rf /']
"""
 
config = yaml.load(dangerous_yaml, Loader=yaml.Loader)  # Executes rm -rf /

Always use safe_load():

python

import yaml
 
with open('config.yaml') as f:
    config = yaml.safe_load(f)  # Safe. Won't execute code.

safe_load() only constructs simple Python objects (dicts, lists, strings, numbers) and raises an error for anything else. This is not just a best practice, it is the difference between a config parser and a remote code execution vulnerability. Even if you trust the source of your YAML today, using safe_load() as a habit means you never accidentally use the wrong function when the data source changes.

Writing YAML

python

import yaml
 
config = {
    'database': {
        'host': 'localhost',
        'port': 5432,
        'credentials': {
            'user': 'admin',
            'password': 'secret123'
        }
    },
    'debug': True,
    'allowed_hosts': ['localhost', '127.0.0.1']
}
 
with open('config.yaml', 'w') as f:
    yaml.dump(config, f, default_flow_style=False)
 
# Output:
# allowed_hosts:
# - 127.0.0.1
# - localhost
# database:
#   credentials:
#     password: secret123
#     user: admin
#   host: localhost
#   port: 5432
# debug: true

Notice that yaml.dump() sorts keys alphabetically by default. This is generally desirable for consistency, but if the order of keys carries semantic meaning or if you want to preserve the order in which you defined the config, you can pass sort_keys=False. The default_flow_style=False parameter tells PyYAML to use the block style (one item per line with indentation) rather than the inline JSON-like style, block style is almost always more readable for configuration files.

YAML Anchors and Aliases (DRY Configuration)

YAML's power comes from anchors and aliases, which let you reuse content:

yaml

# config.yaml
default_settings: &default
  timeout: 30
  retries: 3
  logging: true
 
development:
  <<: *default
  debug: true
 
production:
  <<: *default
  debug: false
  timeout: 60

Reading this:

python

import yaml
 
with open('config.yaml') as f:
    config = yaml.safe_load(f)
 
print(config['development'])
# {'timeout': 30, 'retries': 3, 'logging': True, 'debug': True}
 
print(config['production'])
# {'timeout': 60, 'retries': 3, 'logging': True, 'debug': False}

The << operator merges the anchor's values into the current object. Aliases help you avoid repetition and make configuration maintenance easier. When you see a setting repeated across multiple environments in a YAML file, that is your signal to extract it into an anchor, if you need to change the default retry count from 3 to 5, you want to make that change in exactly one place.

Round-Trip YAML with ruamel.yaml

If you need to preserve comments, formatting, and anchor order when reading and writing YAML (useful for user-edited config files), use ruamel.yaml:

python

from ruamel.yaml import YAML
 
yaml = YAML()
yaml.preserve_quotes = True
yaml.default_flow_style = False
 
# Read YAML
with open('config.yaml') as f:
    config = yaml.load(f)
 
# Modify it
config['new_key'] = 'new_value'
 
# Write it back with comments and formatting preserved
with open('config.yaml', 'w') as f:
    yaml.dump(config, f)

ruamel.yaml is especially useful when:

Users manually edit YAML files and you want to preserve their comments
You're building tools that need to modify config files without destroying human annotations
You care about the order of keys (regular YAML doesn't guarantee order in Python < 3.7)

The distinction between PyYAML and ruamel.yaml matters most when you are building developer tools, things like CLI configuration managers, deployment scripts that update config files, or IDE plugins that modify settings. In those contexts, blowing away a user's carefully-written comments the first time your tool touches their file will generate justified complaints. The ruamel.yaml approach treats the YAML file as a document to be preserved rather than a data structure to be parsed and re-serialized.

YAML Multiline Strings

YAML handles multiline strings beautifully, unlike JSON:

yaml

# Using |, YAML preserves newlines
description: |
  This is a long description
  that spans multiple lines.
  Each line is preserved.
 
# Using >, YAML folds newlines into spaces
summary: >
  This is a long summary
  that spans multiple lines.
  Newlines become spaces.
 
# Using |-, strip final newline
code: |-
  def hello():
      print("world")

Reading multiline YAML:

python

import yaml
 
yaml_text = """
description: |
  Line 1
  Line 2
  Line 3
"""
 
config = yaml.safe_load(yaml_text)
print(repr(config['description']))
# 'Line 1\nLine 2\nLine 3\n'

The | vs > distinction matters when you are storing content that will be displayed to users. Use | for anything where line breaks are meaningful, code snippets, error messages, addresses, poetry. Use > when you have long prose that you want to wrap for readability in the config file but display as a single paragraph at runtime. The |- and >- variants strip the trailing newline, which is what you usually want when storing code snippets or values that will be compared against strings that do not end in a newline.

Choosing Between CSV, Excel, and YAML

Now that you know how each works, when should you use which?

Use CSV when:

Exchanging data with legacy systems
Storing simple tabular data
You need universal compatibility (Excel, Python, command-line tools)
File size matters (CSV is smaller than Excel)
You don't need multiple sheets or complex formatting

Use Excel when:

Stakeholders are working in Excel (common in business)
You need multiple sheets, named ranges, or complex structure
Formatting and visual presentation matter
Users will edit the file manually and expect Excel features
You're dealing with financial or audit-critical data (Excel's built-in functions are auditable)

Use YAML when:

You're storing configuration (Kubernetes, Docker, Ansible all use YAML)
You want human-readable format with comments
You need complex nested structures but don't want JSON verbosity
You're building tools that users will manually edit
You control the source (security: never parse untrusted YAML with load())

Bonus: JSON

When you need language-agnostic APIs
When you're storing structured data in databases
When you need schema validation (JSON Schema)

Bonus: TOML

For simple config files (Python's pyproject.toml uses this)
When you want something between INI and YAML in complexity

Practical Example: Building a Data Pipeline

Here's a real-world example that brings it all together. Imagine you need to:

Read sales data from a CSV file
Validate it using a YAML config file
Write cleaned data to Excel for stakeholders

This three-format pipeline is more common than you might think. The CSV comes from an automated export, a database dump, an API response, a legacy system feed. The YAML carries the business rules that determine what "valid" means, kept in a config file so non-programmers can update thresholds without touching code. The Excel output goes to the people who will act on the data, formatted in a way that makes their job easier. Each format is in its natural home.

python

import csv
import yaml
from openpyxl import Workbook
 
# Step 1: Load validation config
with open('validation.yaml') as f:
    config = yaml.safe_load(f)
 
# validation.yaml contains:
# required_fields: ['date', 'product', 'amount']
# amount_range: [0, 1000000]
 
required_fields = config['required_fields']
min_amount, max_amount = config['amount_range']
 
# Step 2: Read and validate CSV
errors = []
valid_records = []
 
with open('sales.csv', encoding='utf-8') as f:
    reader = csv.DictReader(f)
    for row_num, row in enumerate(reader, start=2):
        # Check required fields
        if not all(row.get(field) for field in required_fields):
            errors.append(f"Row {row_num}: Missing required field")
            continue
 
        # Validate amount
        try:
            amount = float(row['amount'])
            if not (min_amount <= amount <= max_amount):
                errors.append(f"Row {row_num}: Amount {amount} out of range")
                continue
        except ValueError:
            errors.append(f"Row {row_num}: Invalid amount format")
            continue
 
        valid_records.append(row)
 
# Step 3: Write to Excel
wb = Workbook()
ws = wb.active
ws.title = 'Sales Data'
 
# Header
headers = required_fields + ['status']
ws.append(headers)
 
# Data
for record in valid_records:
    row_data = [record[field] for field in required_fields] + ['Valid']
    ws.append(row_data)
 
wb.save('cleaned_sales.xlsx')
 
# Report
print(f"Processed: {len(valid_records)} valid, {len(errors)} errors")
for error in errors:
    print(f"  {error}")

The hidden layer: this pipeline decouples validation logic (YAML config) from business logic (Python code). A non-programmer can modify validation rules in the YAML file without touching code. This separation of concerns is what makes systems maintainable. When the business decides that the maximum valid amount should be two million instead of one million, they change one number in a YAML file, not a pull request, not a deployment, not a developer ticket.

Advanced CSV Techniques: Performance and Streaming

When you're dealing with millions of rows, memory matters. The csv module's generator-based design is already efficient, but you need to know how to push it further.

Memory-Efficient Reading

Don't load everything into memory. Process rows as they come:

python

import csv
 
# WRONG - loads entire file into memory
with open('huge_file.csv') as f:
    all_data = list(csv.DictReader(f))  # 500GB file? Welcome to swapped memory!
 
# RIGHT - processes one row at a time
with open('huge_file.csv') as f:
    reader = csv.DictReader(f)
    for row in reader:
        # Process immediately, don't accumulate
        process_row(row)

If you do need to accumulate (rare but it happens), use generators:

python

import csv
 
def read_csv_batches(filename, batch_size=1000):
    """Read CSV in batches to balance memory vs performance."""
    with open(filename) as f:
        reader = csv.DictReader(f)
        batch = []
        for row in reader:
            batch.append(row)
            if len(batch) >= batch_size:
                yield batch
                batch = []
        if batch:
            yield batch
 
# Use it
for batch in read_csv_batches('huge_file.csv'):
    # Process batch of 1000 rows
    insert_into_database(batch)

The batch approach is a common pattern for database ingestion. Processing rows one at a time with a separate database insert per row is extremely slow due to network round-trip overhead. Processing the entire file at once risks memory exhaustion for large files. The batch approach splits the difference, gather one thousand rows, insert them in a single database transaction, release the memory, and continue. The batch size is a tunable parameter that you adjust based on your memory constraints and database performance characteristics.

Custom Field Conversion

The csv module gives you strings. Converting to proper types requires manual work. Use a conversion function:

python

import csv
from datetime import datetime
 
CONVERTERS = {
    'amount': float,
    'quantity': int,
    'date': lambda x: datetime.strptime(x, '%Y-%m-%d'),
    'is_active': lambda x: x.lower() in ['true', 'yes', '1'],
}
 
def convert_row(row, converters):
    """Convert row values using mapping."""
    for field, converter in converters.items():
        if field in row and row[field]:
            try:
                row[field] = converter(row[field])
            except (ValueError, TypeError) as e:
                row[field] = None  # or raise, depending on your needs
    return row
 
with open('data.csv') as f:
    reader = csv.DictReader(f)
    for row in reader:
        row = convert_row(row, CONVERTERS)
        print(f"Amount: {row['amount']} (type: {type(row['amount'])})")

The CONVERTERS dictionary approach is more flexible than hardcoding type conversions inline because you can pass different converter maps for different file formats. It also centralizes the type conversion logic so you can test it independently of the file reading logic. If you are building a system that ingests data from multiple sources with similar but slightly different schemas, you can maintain one converter map per source and swap them at runtime.

Advanced Excel Techniques

Data Validation and Conditional Formatting

openpyxl lets you add Excel's built-in features programmatically:

python

from openpyxl import Workbook
from openpyxl.worksheet.datavalidation import DataValidation
from openpyxl.formatting.rule import CellIsRule
from openpyxl.styles import PatternFill
 
wb = Workbook()
ws = wb.active
 
# Add headers
headers = ['Name', 'Email', 'Status']
ws.append(headers)
 
# Data validation: Status must be from a list
dv = DataValidation(
    type='list',
    formula1='"Active,Inactive,Pending"',
    allow_blank=True
)
dv.error = 'Please select: Active, Inactive, or Pending'
dv.errorTitle = 'Invalid Entry'
ws.add_data_validation(dv)
dv.add(f'C2:C1000')  # Apply to column C, rows 2-1000
 
# Conditional formatting: Highlight blank emails
red_fill = PatternFill(start_color='FF0000', end_color='FF0000', fill_type='solid')
red_rule = CellIsRule(operator='equal', formula=['"'], fill=red_fill)
ws.conditional_formatting.add(f'B2:B1000', red_rule)
 
wb.save('validated.xlsx')

Adding dropdown validation to a column transforms a data-entry spreadsheet from a blank grid into a guided form. Users see the allowed values, get an error message when they type something invalid, and the data your downstream system receives is clean. Conditional formatting, like highlighting blank required fields in red, gives users immediate visual feedback about what still needs to be filled in. These features take your report from "here is a file" to "here is a tool."

Merging Cells and Complex Layouts

For reports with merged headers and custom layouts:

python

from openpyxl import Workbook
from openpyxl.styles import Font, Alignment
 
wb = Workbook()
ws = wb.active
 
# Merge title cells
ws.merge_cells('A1:D1')
title_cell = ws['A1']
title_cell.value = 'Monthly Sales Report'
title_cell.font = Font(size=16, bold=True)
title_cell.alignment = Alignment(horizontal='center', vertical='center')
 
# Skip a row
ws['A3'].value = 'Region'
ws['B3'].value = 'Revenue'
ws['C3'].value = 'Growth %'
ws['D3'].value = 'Status'
 
# Add data
ws['A4'] = 'North'
ws['B4'] = 100000
ws['C4'] = 0.15
ws['D4'] = 'On Track'
 
# Set column widths
ws.column_dimensions['A'].width = 15
ws.column_dimensions['B'].width = 15
ws.column_dimensions['C'].width = 15
ws.column_dimensions['D'].width = 15
 
wb.save('report.xlsx')

Setting column widths programmatically is one of those small touches that separates a professional-looking report from a developer's raw data dump. The default column width in Excel is just wide enough for a few characters, without explicit width settings, your headers will be truncated and users will need to manually resize every column before they can read the data. The ws.column_dimensions['A'].width = 15 pattern lets you set widths to match your content, and you can calculate appropriate widths dynamically based on the maximum content length in each column.

Extracting Data from Excel Formulas

Sometimes you receive Excel files with formulas. You might need just the computed values or the formulas themselves:

python

from openpyxl import load_workbook
 
wb = load_workbook('formulas.xlsx')
ws = wb.active
 
# Get computed values (what Excel displays)
for row in ws.iter_rows(values_only=True):
    print(row)
 
# Get formulas themselves
for row in ws.iter_rows():
    for cell in row:
        if cell.value and isinstance(cell.value, str) and cell.value.startswith('='):
            print(f"{cell.coordinate}: {cell.value}")
 
# Extract both
wb_formulas = load_workbook('formulas.xlsx')
wb_values = load_workbook('formulas.xlsx', data_only=True)
 
for (cell_f, cell_v) in zip(
    wb_formulas.active.iter_rows(),
    wb_values.active.iter_rows()
):
    for cf, cv in zip(cell_f, cell_v):
        if cf.value != cv.value:
            print(f"{cf.coordinate}: formula={cf.value}, value={cv.value}")

The data_only=True flag deserves special attention here. When you load a workbook with data_only=True, openpyxl reads the cached formula results that Excel stored the last time the file was saved. If the file was never opened in Excel after the data changed, those cached values may be stale. This is not a bug in openpyxl, it is a fundamental limitation of the XLSX format. If you need guaranteed-current computed values, you need to open the file in Excel (or a headless Excel process) to trigger recalculation before reading with openpyxl.

Advanced YAML: Structuring Large Configs

Splitting YAML Across Multiple Files

Large configuration systems often split YAML across files and merge them:

python

import yaml
from pathlib import Path
 
def load_yaml_tree(directory):
    """Load all YAML files in a directory tree and merge them."""
    config = {}
 
    for yaml_file in sorted(Path(directory).rglob('*.yaml')):
        with open(yaml_file) as f:
            file_config = yaml.safe_load(f)
        if file_config:
            # Merge at the top level
            config.update(file_config)
 
    return config
 
# Directory structure:
# configs/
#   database.yaml
#   logging.yaml
#   security.yaml
 
config = load_yaml_tree('configs/')

This pattern shows up in frameworks like Helm (Kubernetes package manager), where configuration is split across base files and override files. The sorted() call on the file list is important, it ensures deterministic merge order, so the same set of config files always produces the same result regardless of filesystem ordering. For more sophisticated merging (deep merge rather than shallow update), you will want a helper function that recursively merges nested dicts rather than clobbering them at the top level.

Custom YAML Deserialization

For type-safe YAML loading, define custom constructors:

python

import yaml
from datetime import datetime
 
class Config:
    def __init__(self, data):
        self.data = data
 
    def __repr__(self):
        return f"Config({self.data})"
 
def config_constructor(loader, node):
    """Custom constructor for !config tags."""
    values = loader.construct_mapping(node)
    return Config(values)
 
yaml.add_constructor('!config', config_constructor, Loader=yaml.SafeLoader)
 
yaml_text = """
app: !config
  name: MyApp
  version: 1.0
  debug: true
"""
 
config = yaml.safe_load(yaml_text)
print(config['app'])  # Config({'name': 'MyApp', 'version': 1.0, 'debug': True})

Custom constructors combined with yaml.SafeLoader give you the best of both worlds, the ability to deserialize YAML into typed Python objects without the security risk of allowing arbitrary Python execution. The !config tag acts like a type annotation in the YAML file, telling the parser to use your constructor instead of the default dict constructor. This is the pattern that powers Kubernetes' custom resource definitions and Ansible's custom modules.

Environment Variable Interpolation in YAML

A common pattern: use environment variables in YAML configs:

python

import yaml
import os
import re
 
def env_constructor(loader, node):
    """Allow ${VAR} syntax in YAML."""
    value = loader.construct_scalar(node)
    # Replace ${VAR} with environment variable
    return re.sub(
        r'\$\{([^}]+)\}',
        lambda m: os.environ.get(m.group(1), ''),
        value
    )
 
yaml.add_constructor('!env', env_constructor, Loader=yaml.SafeLoader)
 
yaml_text = """
database:
  host: localhost
  port: !env ${DB_PORT}
  user: !env ${DB_USER}
"""
 
os.environ['DB_PORT'] = '5432'
os.environ['DB_USER'] = 'admin'
 
config = yaml.safe_load(yaml_text)
print(config['database']['port'])  # '5432'

Environment variable interpolation in config files is a foundational pattern for twelve-factor app architecture. The config file lives in version control without any secrets; the secrets live in environment variables that are injected at runtime. This lets you use the same config file across development, staging, and production environments, only the environment variables change. The !env tag approach shown here is explicit about which values come from the environment, making it easy to audit what your application's configuration surface looks like.

Common File Processing Mistakes

Even experienced developers make predictable mistakes when processing files in Python. Understanding these failure modes before you encounter them saves significant debugging time and prevents data corruption bugs that can be surprisingly hard to diagnose after the fact.

The most costly mistake is silent data corruption, writing code that appears to work but subtly mangles the data. The classic example is the manual CSV join we showed earlier: f"{name},{email}\n". This code works perfectly for a year, then one day a user enters a comma in their display name and you have a corrupted CSV that downstream systems cannot parse. The corruption only appears in specific records, making it hard to notice and harder to trace back to the root cause. Using the csv module's built-in quoting eliminates this entire class of bugs.

A second common mistake is ignoring encoding entirely. Code that opens files without specifying an encoding works fine in development on a Linux machine with UTF-8 as the system default, then breaks in production on a Windows server with cp1252, then breaks again when processing files uploaded by users on Japanese Windows systems (cp932). Always specify encoding explicitly, even if you are "sure" the files will always be UTF-8. The line open('file.csv', encoding='utf-8') costs you nothing when the assumption is correct and saves you hours when it is not.

A third mistake is loading entire large files into memory. The pattern data = list(csv.DictReader(f)) is convenient for small files but becomes a serious problem at scale. A CSV file with ten million rows and twenty columns can easily consume multiple gigabytes of memory when loaded as a list of dicts. Generator-based processing, where you process one row or one batch at a time, keeps memory usage constant regardless of file size. Build the streaming habit early and you will never need to retrofit it later when file sizes outgrow your server's RAM.

Finally, watch out for the "works on my machine" trap with YAML booleans. YAML 1.1 (which PyYAML implements) treats yes, no, on, off, true, and false as boolean values. This means a config value of country: NO becomes country: False in Python, a bug that only manifests when someone sets a country code to "NO" (Norway's ISO 3166-1 alpha-2 code). Always quote values in YAML that should be strings but might look like booleans, and document this gotcha for anyone who edits your config files.

Error Handling and Resilience

Real data is messy. Here's how to handle it gracefully:

python

import csv
import logging
from typing import List, Tuple
 
logging.basicConfig(level=logging.INFO)
logger = logging.getLogger(__name__)
 
def safe_read_csv(filename: str) -> Tuple[List[dict], List[str]]:
    """Read CSV and return (valid_rows, error_list)."""
    valid_rows = []
    errors = []
 
    # Try different encodings
    for encoding in ['utf-8', 'cp1252', 'iso-8859-1']:
        try:
            with open(filename, encoding=encoding) as f:
                reader = csv.DictReader(f)
                for row_num, row in enumerate(reader, start=2):
                    try:
                        # Validate required fields
                        if not row.get('id'):
                            errors.append(f"Row {row_num}: Missing ID")
                            continue
 
                        valid_rows.append(row)
 
                    except Exception as e:
                        errors.append(f"Row {row_num}: {str(e)}")
                        logger.warning(f"Error on row {row_num}: {e}")
 
            logger.info(f"Successfully read with {encoding}")
            return valid_rows, errors
 
        except UnicodeDecodeError:
            logger.debug(f"Failed to read with {encoding}")
            continue
 
    raise ValueError(f"Could not decode {filename} with any encoding")

The return type Tuple[List[dict], List[str]] communicates clearly that this function has two outputs: the good data and the problems. This pattern, collecting errors rather than raising immediately, is the foundation of production-quality data ingestion. Your callers can decide what to do with the errors: write them to a log file, send them to an error reporting service, include them in a summary email to the data owner, or raise an exception if the error rate exceeds a threshold. Burying that decision inside the file reader function is the wrong abstraction.

Summary

You now understand when and how to use CSV, Excel, and YAML in Python, and more importantly, you understand the why behind each tool choice. This is the knowledge that separates someone who can look up the API documentation from someone who can design data systems that work reliably in production.

CSV: Use Python's built-in csv module with DictReader/DictWriter. Watch out for delimiters, quoting, and encoding. Never split on commas manually. Use streaming patterns for large files and always specify encoding explicitly.
Excel: Use openpyxl for multiple sheets, formatting, and named ranges. Use read_only=True for huge files. Add data validation and conditional formatting to make reports that guide users rather than just holding data. Use data_only=True when you need computed values rather than formulas.
YAML: Always use safe_load() without exception. Use ruamel.yaml if you need round-trip preservation of comments and formatting. Use anchors and aliases to eliminate repetition in multi-environment configs. Be aware of the boolean gotcha with values like yes, no, and country codes.

Choose CSV for simplicity and compatibility, Excel when stakeholders demand it, and YAML for human-editable configuration. Each format exists because it solves real problems in real systems. The data pipeline pattern we built in this article, CSV input, YAML validation rules, Excel output, is a template you can adapt to dozens of real-world scenarios. Master all three formats, and you will handle 95% of file processing problems you encounter in production code. The remaining 5% are usually custom binary formats, at which point you will be well-equipped to reach for the right specialized library.

Next up: We're diving into SQLite, Python's built-in database. After working with files, you're ready to learn how databases actually store and query data, and why structured storage beats flat files for anything beyond the simplest use cases.

CSV, Excel, and YAML Processing in Python

Why These Three Formats Still Matter

Choosing the Right Format

The CSV Module: Built-In Power (and Pitfalls)

Basic Reading: csv.reader vs csv.DictReader

Writing CSV Files: DictWriter for Sanity

CSV Edge Cases That Will Bite You

Excel Files: Beyond CSV with openpyxl

Basic Reading and Writing

Working with Multiple Sheets

Named Ranges and Cell Addressing

Formatting and Formulas

Reading from Large Excel Files

Excel Automation Patterns

YAML: Configuration and Human-Readable Data

Basic YAML Reading and Writing

Writing YAML

YAML Anchors and Aliases (DRY Configuration)

Round-Trip YAML with ruamel.yaml

YAML Multiline Strings

Choosing Between CSV, Excel, and YAML

Practical Example: Building a Data Pipeline

Advanced CSV Techniques: Performance and Streaming

Memory-Efficient Reading

Custom Field Conversion

Advanced Excel Techniques

Data Validation and Conditional Formatting

Merging Cells and Complex Layouts

Extracting Data from Excel Formulas

Advanced YAML: Structuring Large Configs

Splitting YAML Across Multiple Files

Custom YAML Deserialization

Environment Variable Interpolation in YAML

Common File Processing Mistakes

Error Handling and Resilience

Summary

Need help implementing this?