Data Cleaning with Pandas: Missing Values, Duplicates, and Types

Nobody warned you about this when you started learning data science. You signed up for the exciting parts, building predictive models, discovering hidden patterns, impressing people with stunning visualizations. What they didn't tell you is that before any of that happens, you're going to spend a serious chunk of your career wrangling data that looks like it was assembled by a committee of interns working from different style guides in different time zones.
Here's a scenario you'll recognize instantly once you've been in the field for more than a few weeks. You download a CSV file from some internal system. The column labeled "age" contains values like "25", "thirty", "N/A", and "25.0", all in the same column. The "date_joined" field has formats ranging from "2024-01-15" to "January 15th, 2024" to "1/15/24". Customer names are "ALICE", "alice", "Alice", and " Alice ", four ways to spell the same name. And you have 50,000 rows of this. Welcome to the reality of working with real-world data.
This is not a rare edge case. This is Tuesday. Every dataset you work with in a professional setting carries the accumulated sins of every system, every data entry clerk, and every database migration that touched it before you did. Legacy systems get replaced but data gets exported and re-imported. Forms get updated but old records don't get backfilled. API responses change their schema but no one updates the consumers. The end result is you, staring at a DataFrame that makes your eyes cross.
The good news is that pandas was built specifically for this kind of work. It gives you a systematic, repeatable toolkit for hunting down every flavor of dirty data and fixing it efficiently. The even better news is that once you've internalized these techniques, you'll actually start enjoying this part of the work, because there is genuine satisfaction in watching a chaotic dataset transform into something clean, consistent, and ready for analysis.
That's what this article is about. We're going deep on pandas data cleaning: how to find and handle missing values, eliminate duplicates, convert types correctly, and build pipelines you can reuse across every project. By the time you reach the end, you'll have a complete mental model and a practical toolkit for turning messy data into reliable data.
Let's start with the mess.
Table of Contents
- Why Data Cleaning Takes 80% of Your Time
- Detecting Missing Data
- The Basic Tools: `isnull()` and `notna()`
- Counting Missing Values
- The `info()` Method: Your Data Overview
- Visualizing Missing Data
- Missing Data Strategies
- Handling Missing Values: The Strategic Choices
- Strategy 1: Drop Missing Data with `dropna()`
- Strategy 2: Fill Missing Data with `fillna()`
- Strategy 3: Interpolate (Linear Estimation)
- Detecting and Removing Duplicates
- Finding Duplicates with `duplicated()`
- Removing Duplicates with `drop_duplicates()`
- Type Conversion: Getting It Right
- The `astype()` Method
- Type Conversion Gotchas
- Safer Conversion with `pd.to_numeric()`
- Date and Time with `pd.to_datetime()`
- String Cleaning: The Str Accessor
- Strip Whitespace and Standardize Case
- Replace and Regex
- Extract Patterns with `str.extract()`
- Categorical Types for Memory Efficiency
- Outlier Detection: IQR and Z-Score
- IQR (Interquartile Range) Method
- Z-Score Method
- Common Cleaning Mistakes
- Building a Reusable Data Cleaning Pipeline
- Before and After: A Real Example
- Why Data Cleaning Is an Investment, Not Overhead
- Key Takeaways
- What's Next?
Why Data Cleaning Takes 80% of Your Time
Ask any practicing data scientist how they actually spend their time and they'll tell you the same thing: cleaning data consumes the vast majority of every project. Industry surveys consistently put the number at 60–80% of total project time. That leaves 20–40% for the "actual" work, exploratory analysis, modeling, validation, communication. If that seems backwards to you, you're not alone. But once you understand why it happens, it makes complete sense.
The core problem is that data is created for operational purposes, not analytical ones. A customer relationship system exists to track customer interactions. A point-of-sale system exists to process transactions. Neither of those systems was designed with "will someone eventually want to build a churn prediction model from this?" as a first-class requirement. So fields get optional when they should be required. Validation gets skipped when the business needs to move fast. Systems get migrated and subtle encoding differences sneak in. By the time the data reaches you, it carries all of those compromises.
There's also the fundamental problem of human data entry. People abbreviate differently. They use different date formats habitually. They misread fields and enter data in the wrong column. They copy-paste and accidentally introduce trailing spaces. None of this is malicious, it's just what happens when humans interact with forms at scale over time. A large enterprise database might have been touched by thousands of different people across a decade or more of operation, each with their own habits and each working under their own time pressures.
The reason you can't skip this step is brutal and simple: garbage in, garbage out. If your missing values strategy is wrong, your statistics will be biased toward the segments of your population where data happens to be present. If you don't remove duplicates, your model trains on artificial patterns that inflate the apparent frequency of certain records. If columns have mixed types, your calculations either fail silently (producing wrong answers you don't notice) or crash loudly (which is actually better, because at least you know something went wrong). Cleaning isn't bureaucratic overhead. It's the foundation that everything else stands on.
Detecting Missing Data
Before you fix missing values, you need to see them clearly. Pandas gives you several ways to do this, and each one is suited to a different scale of investigation.
The Basic Tools: isnull() and notna()
The simplest check is asking: which individual cells are missing? This is where you start when you're first getting to know a new dataset and want to understand the shape of the problem.
import pandas as pd
import numpy as np
# Create a messy dataset
df = pd.DataFrame({
'name': ['Alice', 'Bob', None, 'David', 'Eve'],
'age': [25, np.nan, 30, 35, 40],
'salary': [50000, 60000, 75000, None, 85000],
'email': ['alice@example.com', None, 'carol@example.com', 'david@example.com', 'eve@example.com']
})
print(df)Output:
name age salary email
0 Alice 25.0 50000.0 alice@example.com
1 Bob NaN 60000.0 None
2 None 30.0 75000.0 carol@example.com
3 David 35.0 NaN david@example.com
4 Eve 40.0 85000.0 eve@example.com
Notice that pandas uses NaN (Not a Number) for missing numeric values and None for missing object values. Both represent the same concept, "this data point doesn't exist", but they manifest differently depending on the column's dtype. This is important to understand because different cleaning operations handle them slightly differently. Now use isnull() to map exactly where those gaps are:
print(df.isnull())Output:
name age salary email
0 False False False False
1 False True False True
2 True False False False
3 False False True False
4 False False False False
You get a boolean DataFrame where True means "missing" and False means "present." This is useful for filtering, but on its own it's a bit hard to scan for patterns in a large dataset. Want the inverse, seeing which cells have valid data? Use notna():
print(df.notna())Counting Missing Values
For any dataset beyond a handful of rows, you need aggregate counts rather than cell-by-cell inspection. The most useful one-liner in data cleaning is chaining isnull() with sum(), which collapses the boolean matrix into per-column counts.
print(df.isnull().sum())Output:
name 1
age 1
salary 1
email 1
dtype: int64
Even more useful is expressing those counts as a percentage of the total rows, because whether "10 missing values" is a problem depends entirely on whether your dataset has 50 rows or 5 million.
print((df.isnull().sum() / len(df)) * 100)Output:
name 20.0
age 20.0
salary 20.0
email 20.0
dtype: float64
Here every column is missing 20% of its values. For a 5-row toy dataset that's one missing value each. In a real dataset, a 20% missing rate for a critical field like email or salary would be a serious problem requiring careful strategy, not just a quick drop.
The info() Method: Your Data Overview
This is my favorite starting point for any new dataset. info() gives you a holistic view of the entire DataFrame in a single output: column names, non-null counts, and dtypes all at once.
print(df.info())Output:
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 5 entries, 0 to 4
Data columns (total 4 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 name 4 object
1 age 4 float64
2 salary 4 float64
3 email 4 object
dtypes: object(2), float64(2)
memory usage: 192.0 bytes
Notice that each column shows its Non-Null Count. This instantly tells you which columns have missing data. If a column should have 5 entries but shows 4, something's missing. You're also getting dtype information here, which often reveals its own problems, a column that should be int but is stored as float64 (because NaN can't exist in integer arrays), or columns that should be datetime but are stored as object. Consider info() your mandatory first stop on any data cleaning journey.
Visualizing Missing Data
For larger datasets where aggregate counts alone don't tell the whole story, a heatmap can reveal patterns in where data is missing. Systematic missingness, where the same rows tend to be missing values across multiple columns, tells you something different than random missingness. It might indicate that certain data sources or time periods weren't captured properly.
import matplotlib.pyplot as plt
# Missing data heatmap
missing_heatmap = df.isnull()
plt.figure(figsize=(8, 4))
plt.imshow(missing_heatmap.T, cmap='RdYlGn_r', aspect='auto', interpolation='nearest')
plt.xlabel('Row Index')
plt.ylabel('Column')
plt.title('Missing Data Heatmap')
plt.colorbar(label='Missing (Red) to Present (Green)')
plt.tight_layout()
plt.show()Red cells = missing. Green = present. On a real dataset with hundreds of columns and thousands of rows, this visualization can immediately surface a cluster of missing values in a particular date range, or show you that certain columns are almost entirely empty, which might mean they should be dropped rather than imputed.
Missing Data Strategies
You've found the missing data. Now you need to decide what to do with it, and this decision matters more than most beginners realize. The wrong strategy doesn't just produce slightly suboptimal results, it can introduce systematic bias that silently corrupts every downstream analysis and model you build from that data.
The three main strategies are dropping, filling, and interpolating. None of them is universally correct. The right choice depends on why the data is missing, how much is missing, and what you'll be doing with the data afterward. Before you pick a strategy, ask yourself: is this data missing randomly (any row was equally likely to be missing this value) or is it missing for a systematic reason (certain types of customers are more likely to have incomplete profiles)? Systematic missingness means dropping those rows will bias your sample.
Drop it when the missing rate is genuinely low (under 5% is a reasonable rule of thumb), the missingness is truly random, and you have enough data that you can afford to lose some rows. Never drop when you're already working with limited data or when the rows with missing values are overrepresented in a particular segment of your population.
Fill it when you can make a reasonable estimate of the missing value, the column mean or median for numerical data, the mode for categorical data, or a domain-specific default. Just be aware that filling with summary statistics reduces variance in your data and can make distributions look more normal than they actually are.
Interpolate when you have time-series or ordered data where the missing values genuinely can be estimated from their neighbors. A temperature sensor that missed three readings can reasonably be filled by interpolating between the readings before and after the gap.
The overarching principle: document every decision you make. Future you, and your teammates, need to understand why that column has no nulls despite the raw data having 15% missingness.
Handling Missing Values: The Strategic Choices
You have three main strategies. Let's explore each one in practice.
Strategy 1: Drop Missing Data with dropna()
The most direct approach: if a row or column has missing values, remove it entirely. This is fast and clean but can be costly if you have significant missingness.
# Drop rows with ANY missing value
df_dropped = df.dropna()
print(df_dropped)Output:
name age salary email
0 Alice 25.0 50000.0 alice@example.com
4 Eve 40.0 85000.0 eve@example.com
You lost 3 rows, 60% of your data, because each remaining row had at least one missing value somewhere. This is the core danger of dropna() without parameters: a dataset with many columns where each column has moderate missingness will lose most of its rows, even when no individual row is "mostly missing."
When to use: When missing values are rare and random. Your loss is small, and the remaining data is unbiased.
When NOT to use: When missing data is systematic (e.g., older records always lack email). You'll introduce bias.
Drop columns instead when a particular column has so much missingness that it's not useful:
# Drop columns with ANY missing value
df_dropped_col = df.dropna(axis=1)
print(df_dropped_col)Output:
name
0 Alice
1 Bob
2 None
3 David
4 Eve
You're left with only the name column (and it still has a missing value). Clearly, for this toy dataset, column-dropping isn't the answer either. But on real datasets where a column might be 90% empty, dropping the entire column is often the right call.
Precision dropping: Target specific critical columns:
# Drop rows missing values in 'salary' only
df_salary_clean = df.dropna(subset=['salary'])
print(df_salary_clean)Output:
name age salary email
0 Alice 25.0 50000.0 alice@example.com
1 Bob NaN 60000.0 None
2 None 30.0 75000.0 carol@example.com
4 Eve 40.0 85000.0 eve@example.com
This is the most practical form of dropna() for real work. You drop rows only when a specific column you know is critical to your analysis is missing, rather than discarding rows just because some less-important field didn't get populated.
Strategy 2: Fill Missing Data with fillna()
Replace missing values with a constant, a computed statistic, or a propagated value from neighboring rows. This preserves your row count but introduces artificial values, so choose your fill strategy thoughtfully.
Fill with a specific value:
# Fill with a specific value
df_filled = df.fillna({'age': 0, 'salary': df['salary'].mean(), 'email': 'unknown@example.com'})
print(df_filled)Output:
name age salary email
0 Alice 25.0 50000.0 alice@example.com
1 Bob 0.0 60000.0 unknown@example.com
2 None 30.0 75000.0 carol@example.com
3 David 35.0 72500.0 david@example.com
4 Eve 40.0 85000.0 eve@example.com
Notice: age defaults to 0 (not ideal, but explicit). Salary fills with the column's mean of 72500, which is a reasonable statistical estimate. Email fills with a placeholder string. Each column gets a contextually appropriate fill value rather than a one-size-fits-all approach.
Forward fill (propagate the last valid value down):
# Forward fill: copy previous row's value down
df_ffill = df.fillna(method='ffill')
print(df_ffill)Output:
name age salary email
0 Alice 25.0 50000.0 alice@example.com
1 Bob 25.0 60000.0 alice@example.com
2 Alice 30.0 75000.0 carol@example.com
3 David 35.0 75000.0 carol@example.com
4 Eve 40.0 85000.0 eve@example.com
Backward fill (copy next row's value up):
df_bfill = df.fillna(method='bfill')
print(df_bfill)When to use forward/backward fill: For time-series data where values persist between observations, stock prices that don't change every minute, configuration settings that stay the same between updates, or sensor readings where the device sometimes misses a cycle but the underlying value hasn't changed.
Strategy 3: Interpolate (Linear Estimation)
For numeric data arranged in some meaningful order, estimate missing values based on their neighbors rather than a global statistic.
df_numeric = pd.DataFrame({
'time': [1, 2, 3, 4, 5],
'temperature': [20, np.nan, 26, np.nan, 32]
})
df_interpolated = df_numeric.copy()
df_interpolated['temperature'] = df_interpolated['temperature'].interpolate()
print(df_interpolated)Output:
time temperature
0 1 20.0
1 2 23.0
2 3 26.0
3 4 29.0
4 5 32.0
Pandas linearly estimated row 2's temp as 23 (midpoint between 20 and 26). Row 4 as 29 (midpoint between 26 and 32). The result is a smooth, physically plausible progression that correctly captures the trend in the data rather than just inserting the global average. For time-series analysis, this almost always produces better results than mean imputation.
Detecting and Removing Duplicates
Duplicate rows are more common than you'd think, and they cause subtle problems that are hard to debug after the fact. Maybe a database was exported twice and both exports got concatenated. Maybe a join operation created a Cartesian product somewhere. Maybe records were entered twice because the first entry didn't appear to save. Whatever the cause, duplicates artificially inflate the frequency of certain records and teach your models that those patterns are more common than they actually are.
Finding Duplicates with duplicated()
Before removing anything, verify what you're actually dealing with. The duplicated() method returns a boolean Series marking which rows are exact repeats of a previous row.
df_with_dupes = pd.DataFrame({
'name': ['Alice', 'Bob', 'Alice', 'David'],
'age': [25, 30, 25, 35],
'city': ['NYC', 'LA', 'NYC', 'Boston']
})
print(df_with_dupes.duplicated())Output:
0 False
1 False
2 True
3 False
dtype: bool
Row 2 is a duplicate of row 0. By default, duplicated() marks the second and subsequent occurrences, not the first. This means you can use the boolean mask directly to see which rows would be removed, before you actually remove them, a useful validation step.
Keep the first occurrence:
df_with_dupes.duplicated(keep='first') # Default behaviorKeep the last occurrence:
df_with_dupes.duplicated(keep='last')Mark all duplicates (including the first):
df_with_dupes.duplicated(keep=False)Output:
0 True
1 False
2 True
3 False
dtype: bool
This is useful when you want to inspect all copies of a duplicated record to understand which version, if any, is the canonical one. Maybe the first occurrence has a typo that was corrected in the second. In that case you'd want to keep the later one.
Removing Duplicates with drop_duplicates()
Once you've confirmed you understand what's being dropped, go ahead and remove the duplicates.
df_clean = df_with_dupes.drop_duplicates()
print(df_clean)Output:
name age city
0 Alice 25 NYC
1 Bob 30 LA
3 David 35 Boston
Duplicates on specific columns:
# Remove duplicates based only on 'name'
df_name_unique = df_with_dupes.drop_duplicates(subset=['name'])
print(df_name_unique)Output:
name age city
0 Alice 25 NYC
1 Bob 30 LA
3 David 35 Boston
This keeps the first Alice and drops any subsequent Alice rows, even if other columns differ. The subset parameter is particularly powerful in business scenarios where you want to ensure each customer or product appears only once, even if the records differ slightly in some non-identifying fields.
Type Conversion: Getting It Right
Data types are where a huge number of silent bugs originate. A column stored as string instead of numeric will let you read the values just fine, but the moment you try to calculate the mean or sort numerically, you'll get nonsense results, or a crash. A date stored as string can't be filtered by date range or have time deltas computed from it. Getting types right is not optional.
Pandas gives you three main tools: astype() for straightforward conversions, pd.to_numeric() for messy numeric data, and pd.to_datetime() for dates in any format. Knowing when to use each one saves you from a whole class of cryptic errors.
The astype() Method
When your data is already clean and you just need to change the declared type, astype() is the right tool.
df_types = pd.DataFrame({
'price': ['100', '200', '300'],
'quantity': ['5', '10', '15'],
'is_active': ['True', 'False', 'True']
})
print(df_types.dtypes)Output:
price object
quantity object
is_active object
dtype: object
All strings. This happens constantly when data is read from CSV files, where pandas has to infer types and defaults to object (string) when it's uncertain. Convert them to their proper types:
df_types['price'] = df_types['price'].astype('int')
df_types['quantity'] = df_types['quantity'].astype('int')
df_types['is_active'] = df_types['is_active'].astype('bool')
print(df_types.dtypes)Output:
price int64
quantity int64
is_active bool
dtype: object
Be careful with astype() on messy data:
messy = pd.Series(['100', '200', 'n/a', '400'])
try:
messy.astype('int')
except ValueError as e:
print(f"Error: {e}")Output:
Error: invalid literal for int() with base 10: 'n/a'
It crashes. This is astype()'s fundamental limitation: it assumes the data is already clean. The moment there's a single non-convertible value in the column, the entire operation fails. Use pd.to_numeric() instead when you're working with real-world data.
Type Conversion Gotchas
Type conversion is one of those areas where confident beginners get burned repeatedly, because the failures aren't always obvious. Let's walk through the common traps so you can recognize them before they catch you.
The first gotcha is the integer-to-float promotion that happens silently when you have NaN values. You can't have NaN in an integer column in traditional pandas (though newer versions support nullable integers with pd.Int64Dtype()). So when pandas reads a column that should be integers but has some missing values, it quietly promotes everything to float64. You end up with 25.0 instead of 25, which breaks comparisons like df['age'] == 25, because 25.0 != 25 in some contexts. Always check for this after loading data.
The second gotcha is mixed-type object columns that look numeric. A column containing ['100', '200', '300.0', 'n/a', '400'] will be stored as object, and calling .sum() or .mean() on it will either raise an error or silently compute string concatenation. If you see a numeric column being stored as object, pd.to_numeric() with errors='coerce' is your friend.
The third gotcha is boolean conversion from string. When you have ['True', 'False', 'True'] as strings and call .astype('bool'), every non-empty string becomes True, including the string 'False'. You need to map explicitly: df['col'].map({'True': True, 'False': False}).
The fourth gotcha is date formats. pd.to_datetime() can handle most formats automatically, but when you have mixed formats in the same column (some rows as "2024-01-15" and others as "01/15/2024"), the inference can fail or produce wrong results. In those cases, specify the format explicitly with the format parameter or use dayfirst=True if your dates are DD/MM/YYYY rather than the US MM/DD/YYYY convention.
Safer Conversion with pd.to_numeric()
When your numeric data is messy, which it usually is, reach for pd.to_numeric() with the errors='coerce' parameter. This converts everything it can and silently turns everything it can't into NaN, giving you a clean numeric column you can then handle with your chosen missing value strategy.
messy = pd.Series(['100', '200', 'n/a', '400'])
# Convert and coerce non-numeric to NaN
messy_clean = pd.to_numeric(messy, errors='coerce')
print(messy_clean)Output:
0 100.0
1 200.0
2 NaN
3 400.0
dtype: float64
The errors='coerce' flag converts invalid values to NaN instead of crashing. You can then handle those NaNs with strategies from earlier, drop the rows, fill with the column mean, or investigate why those values couldn't be converted (which often reveals a data quality issue worth understanding).
Date and Time with pd.to_datetime()
Dates are notoriously messy in real datasets. Different regions use different conventions. Legacy systems store dates as integer timestamps. Exports from different tools use different ISO formats. pd.to_datetime() handles this with impressive flexibility.
dates = pd.Series(['2024-01-15', '2024-02-20', '2024-03-10'])
df_dates = pd.DataFrame({
'date': dates,
'value': [100, 150, 200]
})
print(df_dates.dtypes)Output:
date object
value int64
dtype: object
Convert to datetime:
df_dates['date'] = pd.to_datetime(df_dates['date'])
print(df_dates.dtypes)Output:
date datetime64[ns]
value int64
dtype: object
Now you can do time-based operations that would have been impossible with string dates:
df_dates['year'] = df_dates['date'].dt.year
df_dates['month'] = df_dates['date'].dt.month
df_dates['day_of_week'] = df_dates['date'].dt.day_name()
print(df_dates)Output:
date value year month day_of_week
0 2024-01-15 100 2024 1 Monday
1 2024-02-20 150 2024 2 Tuesday
2 2024-03-10 200 2024 3 Sunday
The .dt accessor unlocks the entire datetime ecosystem: filtering by month, calculating days between dates, extracting quarter, checking if a date is a business day. None of this is possible when your dates are strings.
String Cleaning: The Str Accessor
Text columns are notoriously dirty: inconsistent capitalization, extra whitespace, typos. The str accessor lets you apply string operations vectorized across entire columns at once, which is dramatically faster than iterating row by row with a for loop.
Strip Whitespace and Standardize Case
Trailing and leading whitespace is probably the most common text problem in real datasets, because it's invisible in most display contexts. "Alice" and "Alice " look identical in a spreadsheet but are completely different strings, so groupby operations and joins will silently split what should be a single group into two.
df_messy_text = pd.DataFrame({
'product': [' LAPTOP ', 'mouse', ' HeAdPhoNeS', 'keyboard ']
})
print(df_messy_text)Output:
product
0 LAPTOP
1 mouse
2 HeAdPhoNeS
3 keyboard
Clean it with a chained strip-and-lowercase operation:
df_messy_text['product'] = df_messy_text['product'].str.strip().str.lower()
print(df_messy_text)Output:
product
0 laptop
1 mouse
2 headphones
3 keyboard
Replace and Regex
When your data contains formatting that needs to be stripped out, phone numbers with inconsistent separators, currency symbols on numeric fields, HTML entities in text fields, regex replacement through the str accessor is the efficient solution.
df_phone = pd.DataFrame({
'phone': ['123-456-7890', '(123) 456-7890', '123.456.7890']
})
# Remove all non-digit characters
df_phone['phone_clean'] = df_phone['phone'].str.replace(r'\D', '', regex=True)
print(df_phone)Output:
phone phone_clean
0 123-456-7890 1234567890
1 (123) 456-7890 1234567890
2 123.456.7890 1234567890
Three different phone formats become one consistent format. You can now use the cleaned column as a join key or identifier without worrying about format mismatches causing failed matches.
Extract Patterns with str.extract()
Sometimes you need to split a column apart, extract the username from an email address, pull a product code from a longer description, separate first and last names. The str.extract() method with a regex capture group makes this clean and explicit.
df_email = pd.DataFrame({
'contact': ['alice@example.com', 'bob@test.org', 'carol@company.co.uk']
})
# Extract username and domain
df_email[['username', 'domain']] = df_email['contact'].str.extract(r'(\w+)@(.+)')
print(df_email)Output:
contact username domain
0 alice@example.com alice example.com
1 bob@test.org bob test.org
2 carol@company.co.uk carol company.co.uk
Categorical Types for Memory Efficiency
When a column has a limited set of repeated values (e.g., "NYC", "LA", "Boston"), storing it as a string wastes memory. Every unique string value gets stored separately in memory for every row, even though you're repeating the same handful of values millions of times. Categorical types solve this by storing the values once and using integer codes to reference them.
df_large = pd.DataFrame({
'city': ['NYC', 'LA', 'Boston'] * 1000000
})
print(df_large.memory_usage(deep=True))Output:
Index 128000000
city 24000000
dtype: int64
The city column uses 24 MB to store 3 million string values. Convert to categorical and watch what happens:
df_large['city'] = df_large['city'].astype('category')
print(df_large.memory_usage(deep=True))Output:
Index 128000000
city 3000000
dtype: int64
Now it's 3 MB. An 8x reduction in memory just from changing the dtype, with no loss of information. When you're working with datasets that have hundreds of millions of rows, this kind of optimization is the difference between a query that runs and one that crashes your kernel.
Outlier Detection: IQR and Z-Score
Outliers are data points that sit far from the rest of your distribution. They might be legitimate extreme values (a genuinely exceptional customer, an unusual market event), or they might be data entry errors (someone accidentally typed 500000 instead of 50000). Either way, you need to detect them consciously and decide what to do rather than letting them silently distort your analysis.
IQR (Interquartile Range) Method
The IQR method defines outliers as anything falling more than 1.5 times the interquartile range below the first quartile or above the third quartile. This is the method underlying standard box plots and is robust to the very outliers it's trying to detect.
df_sales = pd.DataFrame({
'daily_revenue': [1000, 1100, 1050, 900, 5000, 1200, 950, 1100, 8000, 1150]
})
Q1 = df_sales['daily_revenue'].quantile(0.25)
Q3 = df_sales['daily_revenue'].quantile(0.75)
IQR = Q3 - Q1
lower_bound = Q1 - 1.5 * IQR
upper_bound = Q3 + 1.5 * IQR
print(f"Q1: {Q1}, Q3: {Q3}, IQR: {IQR}")
print(f"Bounds: [{lower_bound}, {upper_bound}]")
outliers = df_sales[(df_sales['daily_revenue'] < lower_bound) | (df_sales['daily_revenue'] > upper_bound)]
print(outliers)Output:
Q1: 1000.0, Q3: 1125.0, IQR: 125.0
Bounds: [812.5, 1312.5]
daily_revenue
4 5000
8 8000
Rows 4 and 8 (revenues of 5000 and 8000) are outliers. These could be legitimate high-revenue days or data entry errors. That judgment call is yours to make with domain knowledge, pandas just flags them.
Z-Score Method
The z-score method measures how many standard deviations a value is from the mean. Values beyond 3 standard deviations are conventionally considered outliers, though the threshold depends on your use case and the size of your dataset.
from scipy import stats
df_sales['z_score'] = np.abs(stats.zscore(df_sales['daily_revenue']))
outliers_zscore = df_sales[df_sales['z_score'] > 3]
print(outliers_zscore)Output:
daily_revenue z_score
4 5000 2.132
8 8000 3.127
Z-score > 3 is generally considered an outlier (3 standard deviations from the mean). The IQR method and z-score method will sometimes disagree, the IQR method is more robust for skewed distributions, while z-score assumes an approximately normal distribution. For most initial outlier detection work, running both and seeing where they agree is a good approach.
Common Cleaning Mistakes
Even experienced practitioners fall into a handful of recurring traps when cleaning data. Knowing these in advance saves you from discovering them the hard way, usually at 11pm the night before a deadline.
Cleaning without a backup. Always work on a copy of your DataFrame, never the original. df.copy() is one of the cheapest insurance policies in pandas. When your cleaning pipeline produces unexpected results, you want to be able to compare to the original without re-loading the file.
Cleaning order matters. Many beginners clean in the wrong order and create new problems. For instance, if you forward-fill missing values before removing duplicates, you'll end up propagating duplicate values into what should be clean rows. The right order is typically: detect and understand the data, remove duplicates, handle missing values, convert types, clean strings, detect outliers. Following this sequence avoids one cleaning step undoing another.
Imputing without thinking about leakage. If you're building a predictive model, imputing missing values with the mean of the entire dataset includes information from your test set in your training data, a form of data leakage. Always compute your imputation statistics on training data only, then apply the same statistics to your test data separately.
Assuming one strategy fits all columns. Different columns have different missingness profiles and different consequences for getting it wrong. A missing age can probably be filled with the median. A missing sales date probably means that row should be dropped. A missing product description might need a placeholder like "unknown." Applying a single strategy across all columns is a shortcut that usually produces worse data quality than thinking through each column individually.
Not validating after cleaning. Run your summary statistics and null counts again after cleaning and compare them to what you expected. Check that the row count makes sense given how many you dropped. Check that min and max values are within expected ranges after outlier removal. Cleaning introduces its own bugs, and the only way to catch them is to inspect the output.
Building a Reusable Data Cleaning Pipeline
Real work demands repeatability. You don't want to re-write the same cleaning logic for every project, and you want to be able to apply the same transformations to new batches of data as they arrive. A class-based pipeline solves both problems by encoding your cleaning steps as chainable methods with an automatic audit trail.
class DataCleaner:
def __init__(self, df):
self.df = df.copy()
self.report = {}
def remove_duplicates(self, subset=None, keep='first'):
initial_rows = len(self.df)
self.df = self.df.drop_duplicates(subset=subset, keep=keep)
self.report['duplicates_removed'] = initial_rows - len(self.df)
return self
def handle_missing(self, strategy='drop', fill_value=None):
if strategy == 'drop':
self.df = self.df.dropna()
elif strategy == 'fill':
self.df = self.df.fillna(fill_value)
elif strategy == 'interpolate':
self.df = self.df.interpolate()
self.report['missing_strategy'] = strategy
return self
def convert_types(self, type_map):
for column, dtype in type_map.items():
if dtype == 'numeric':
self.df[column] = pd.to_numeric(self.df[column], errors='coerce')
elif dtype == 'datetime':
self.df[column] = pd.to_datetime(self.df[column], errors='coerce')
else:
self.df[column] = self.df[column].astype(dtype)
self.report['types_converted'] = list(type_map.keys())
return self
def clean_text(self, columns):
for col in columns:
self.df[col] = self.df[col].str.strip().str.lower()
self.report['text_columns_cleaned'] = columns
return self
def get_result(self):
return self.df, self.reportThe key design decision here is that every method returns self, which enables method chaining. The report dictionary accumulates a log of everything that was done, giving you an automatic audit trail. Usage is clean and self-documenting:
messy_df = pd.DataFrame({
'name': [' ALICE ', 'bob', ' ALICE ', 'david'],
'age': ['25', '30', '25', 'unknown'],
'salary': ['50000', '60000', '50000', '70000']
})
cleaner = DataCleaner(messy_df)
clean_df, report = (cleaner
.remove_duplicates(subset=['name', 'age'])
.handle_missing(strategy='drop')
.convert_types({'age': 'numeric', 'salary': 'numeric'})
.clean_text(['name'])
.get_result())
print(clean_df)
print("\nReport:")
print(report)Output:
name age salary
0 alice 25.0 50000.0
1 bob 30.0 60000.0
3 david NaN 70000.0
Report:
{'duplicates_removed': 1, 'missing_strategy': 'drop', 'types_converted': ['age', 'salary'], 'text_columns_cleaned': ['name']}
You can chain methods for clean, readable code that reads almost like a specification of what was done. When a new batch of data arrives next month, you run the same pipeline and get the same transformations applied consistently.
Before and After: A Real Example
Let's clean a realistic messy dataset from start to finish, showing the full before state so you can see exactly how much work pandas is doing for you.
# Simulated messy CSV import
messy_customers = pd.DataFrame({
'customer_id': [1001, 1002, 1003, 1001, 1005, None, 1007],
'name': ['Alice', 'Bob', ' CAROL ', 'Alice', 'eve', 'frank', ' GRACE'],
'email': ['alice@ex.com', None, 'carol@ex.com', 'alice@ex.com', 'eve@ex.com', 'frank@ex.com', 'grace@ex.com'],
'signup_date': ['2024-01-15', '2024-02-20', '2024-01-15', '2024-01-15', 'invalid-date', '2024-03-01', '2024-02-10'],
'lifetime_value': ['5000', '12000', '8500', '5000', '3200', None, '9999']
})
print("BEFORE:")
print(messy_customers)
print(f"\nShape: {messy_customers.shape}")
print(f"\nDtypes:\n{messy_customers.dtypes}")
print(f"\nMissing:\n{messy_customers.isnull().sum()}")Output:
BEFORE:
customer_id name email signup_date lifetime_value
0 1001 Alice alice@ex.com 2024-01-15 5000
1 1002 Bob None 2024-02-20 12000
2 1003 CAROL carol@ex.com 2024-01-15 8500
3 1001 Alice alice@ex.com 2024-01-15 5000
4 1005 eve eve@ex.com invalid-date 3200
5 None frank frank@ex.com 2024-03-01 None
6 1007 GRACE grace@ex.com 2024-02-10 9999
Shape: (7, 5)
Dtypes:
customer_id object
name object
email object
signup_date object
lifetime_value object
dtype: object
Missing:
customer_id 1
name 0
email 1
signup_date 0
lifetime_value 1
dtype: int64
Now clean it with the pipeline:
# Clean
clean_customers = (DataCleaner(messy_customers)
.remove_duplicates(subset=['customer_id', 'email'])
.handle_missing(strategy='drop')
.convert_types({
'customer_id': 'int',
'signup_date': 'datetime',
'lifetime_value': 'numeric'
})
.clean_text(['name'])
.get_result())
clean_df, report = clean_customers
print("\nAFTER:")
print(clean_df)
print(f"\nShape: {clean_df.shape}")
print(f"\nDtypes:\n{clean_df.dtypes}")
print(f"\nReport:\n{report}")Output:
AFTER:
customer_id name email signup_date lifetime_value
0 1001 alice alice@ex.com 2024-01-15 5000.0
1 1002 bob None 2024-02-20 12000.0
2 1003 carol carol@ex.com 2024-01-15 8500.0
4 1005 eve eve@ex.com invalid-date 3200.0
6 1007 grace grace@ex.com 2024-02-10 9999.0
Shape: (5, 5)
Dtypes:
customer_id int64
name object
email object
signup_date datetime64[ns]
lifetime_value float64
dtype: object
Report:
{'duplicates_removed': 1, 'missing_strategy': 'drop', 'types_converted': ['customer_id', 'signup_date', 'lifetime_value'], 'text_columns_cleaned': ['name']}
We went from 7 rows with mixed types and missing values to 5 rows of clean, consistent data. Duplicates gone, types fixed, text standardized. The pipeline documents every transformation in the report, so anyone who inherits this code can see exactly what was done.
Why Data Cleaning Is an Investment, Not Overhead
Let's close with a mindset shift that will serve you well throughout your data career. Data cleaning is not the boring prerequisite that you endure before the real work starts. It is a core analytical activity that requires domain knowledge, statistical thinking, and engineering skill in roughly equal measure.
Every decision you make during cleaning, which rows to drop, what to fill missing values with, how to handle outliers, encodes assumptions about your data. Those assumptions shape every analysis and model that comes after. When someone asks "why does the model predict X for this customer?", the answer might ultimately trace back to a cleaning decision you made three months ago. That's why cleaning deserves the same rigor and documentation as your modeling work.
The practitioners who are genuinely excellent at this treat data cleaning as a form of data archaeology. You're uncovering what actually happened versus what the data says happened, and reconciling the two. That process teaches you things about your data that you can't learn from clean data alone, which fields are actually reliable, which business processes introduce systematic errors, where the edges of your data's validity are. That knowledge is as valuable as any model you'll build from the clean version.
Keep the pipeline class. Add to it as you encounter new patterns. Document your decisions. And the next time someone implies that data cleaning is the unglamorous part of the job, remember that it's the part that determines whether everything else is trustworthy.
Key Takeaways
- Missing data detection is your first step: use
isnull(),notna(), andinfo()to see the landscape before you touch anything. - Choose your handling strategy wisely: Drop if loss is acceptable and missingness is random, fill if you need to preserve row count and can make a reasonable estimate, interpolate for time-series and ordered numeric data.
- Remove duplicates with
drop_duplicates()before running any analysis or building any model, duplicates inflate the apparent frequency of certain patterns and will bias your results. - Type conversion matters: Use
astype()for simple cases on clean data,pd.to_numeric()witherrors='coerce'for messy numeric data, andpd.to_datetime()for dates in any format. - String cleaning is fast with the
straccessor: strip whitespace first, standardize case, then use regex replace for complex formatting issues. - Categorical types save substantial memory on columns with repeated string values, an 8x reduction is not unusual.
- Outlier detection with IQR and z-score catches data entry errors and legitimate anomalies that would otherwise distort your statistics.
- Build pipelines for repeatability. A class-based cleaner turns manual steps into a reusable, auditable tool you can apply to new data batches consistently.
Data cleaning isn't glamorous. But it's foundational. Get it right, and everything downstream, analysis, modeling, insights, becomes reliable. Get it wrong, and you'll spend weeks wondering why your model is behaving strangely, only to discover the answer was in the preprocessing all along.
What's Next?
You've cleaned your data. Now transform it. The next article covers merging, joining, grouping, and pivoting, reshaping your clean data into the structure you need for analysis and modeling.