Here's a scenario that plays out constantly in data-driven teams: someone runs an experiment, sees a promising number, gets excited, ships the change, and then watches in confusion as nothing actually improves in production. The A/B test said conversion went up 40%. The p-value looked small. So what happened?

What happened, almost certainly, is that nobody actually understood the statistics. The number was real; the conclusion was wrong. This is the gap between having data and understanding data, and it's one of the most expensive gaps in modern software and product development. You can have the fanciest dashboard, the prettiest visualizations, and terabytes of behavioral data, but if you're drawing bad conclusions from good numbers, you're flying blind with a very confident co-pilot. The cost of this mistake isn't abstract: it's engineering hours chasing phantom improvements, product decisions that make things worse, and, eventually, a loss of trust in the entire data-driven decision-making process.

That's why scipy.stats matters, and that's why this article exists. We're going to walk through not just the code, but the actual reasoning behind statistical analysis. You'll learn what descriptive statistics really tell you, how probability distributions model the real world, and, critically, how hypothesis testing lets you separate genuine signal from random noise. Along the way, we'll also address one of the most overlooked skills in applied statistics: knowing which test to choose in the first place. That choice can make the difference between a valid conclusion and a misleading one, and it's a skill that textbooks often skip. By the time we're done, you'll have a toolkit you can actually trust.

The running example throughout this guide is user session duration data. It's concrete, it's relatable, and it maps cleanly to real decisions you'd make in product analytics. By the end, you'll be able to take any dataset, ask meaningful questions about it, and back up your answers with rigorous evidence. That's what this is really about, not tests and p-values as ends in themselves, but as tools for making better decisions with more confidence.

Why Statistics Matters for ML

Before we get hands-on, let's address a question that comes up constantly: if I'm going to be doing machine learning, do I really need to understand classical statistics? The answer is a firm, unambiguous yes, and here's why.

Machine learning models don't learn in a vacuum. They learn from data, and data has statistical properties that fundamentally shape what a model can and cannot learn. When your training data is heavily skewed, when classes are imbalanced, when features have wildly different variances, all of these are statistical properties that determine whether your model trains well or catastrophically overfits. Understanding distributions and variance isn't academic background knowledge; it's practical model-building intelligence.

Hypothesis testing shows up directly in ML workflows more than most people realize. When you're doing feature selection, you're essentially testing whether a feature has a statistically significant relationship with the target variable. When you're comparing two models, you're testing whether the performance difference is real or could be explained by random variation in your test set. When you're doing A/B testing to evaluate a deployed model against a baseline, you're doing hypothesis testing. The formalism of "null hypothesis" and "p-value" isn't some separate academic exercise, it's the rigorous way to ask whether your model improvement is genuine.

Statistical thinking also saves you from some of the most common and costly ML mistakes. The classic example is data leakage: accidentally including future information in your training features. You'd notice this immediately if you looked at the statistical relationship between that feature and your target, the correlation would be suspiciously perfect. Outlier detection, cross-validation design, confidence intervals around model metrics, these all draw directly on statistical concepts. Learning scipy.stats now means you'll approach ML problems with a much sharper analytical lens when we get there.

Setting Up Your Toolkit

Let's start with imports. You'll need numpy for data handling, scipy.stats for the statistical magic, and matplotlib or seaborn for visualization. These four libraries form the backbone of nearly every data analysis workflow in Python, and they play together beautifully.

python

import numpy as np
from scipy import stats
import matplotlib.pyplot as plt
import seaborn as sns
 
# Set random seed for reproducibility
np.random.seed(42)
 
# Generate sample data: user session durations (in minutes)
sessions = np.array([5, 8, 12, 7, 15, 9, 11, 6, 13, 8,
                     10, 7, 14, 9, 8, 12, 11, 6, 9, 10])

Notice we're using realistic data: session durations. This isn't some arbitrary array of numbers, it represents something a real analyst would look at, with the natural variability and shape you'd expect from user behavior. Throughout this article, we'll build on this example, asking progressively more sophisticated questions about what it tells us.

Descriptive Statistics: The Foundation

Descriptive stats summarize your data in digestible numbers. But "summary" undersells it. A good set of descriptive statistics is like a personality profile for your dataset, it tells you where data tends to cluster, how spread out it is, and whether it has any unusual shape or quirks you should know about before diving into more complex analysis.

The most important thing to understand about descriptive statistics is that no single number tells the whole story. That's why we compute several of them together.

Mean, Median, Mode, and Beyond

python

# Basic measures
mean = np.mean(sessions)
median = np.median(sessions)
mode = stats.mode(sessions, keepdims=True)
 
print(f"Mean: {mean:.2f} minutes")
print(f"Median: {median:.2f} minutes")
print(f"Mode: {mode.mode[0]} minutes (appears {mode.count[0]} times)")

Output:

Mean: 9.6 minutes
Median: 9.5 minutes
Mode: 8 minutes (appears 4 times)

Why do these differ? The mean captures overall magnitude; the median resists outliers; the mode shows the most common value. For skewed data, median often tells a truer story. Think about income distributions: the mean household income in a country looks respectable, but the median tells you what a typical household actually earns, and when a handful of billionaires are in your dataset, those two numbers can be dramatically different. For session duration, because our mean and median are close, we can already infer the data is fairly symmetric. That's useful information, and we got it before writing a single test.

Variance and Standard Deviation: Understanding Spread

python

variance = np.var(sessions)
std_dev = np.std(sessions)
 
print(f"Variance: {variance:.2f}")
print(f"Standard Deviation: {std_dev:.2f}")
print(f"Coefficient of Variation: {(std_dev/mean)*100:.1f}%")

Standard deviation tells you how far, on average, values stray from the mean. A CV (coefficient of variation) normalizes this for datasets with different scales. The CV is especially useful when you want to compare variability across different datasets, two datasets might have the same standard deviation but very different means, making the CV the fairer comparison. High variability in user sessions might mean you have two distinct user segments behaving differently, which is worth investigating.

Skewness and Kurtosis: Distribution Shape

python

skewness = stats.skew(sessions)
kurtosis = stats.kurtosis(sessions)
 
print(f"Skewness: {skewness:.3f}")
print(f"Kurtosis: {kurtosis:.3f}")

Output:

Skewness: 0.312
Kurtosis: -0.891

Skewness near 0 means symmetric data. Positive skew means a tail toward higher values (right skew). Kurtosis measures tail heaviness, high kurtosis means extreme outliers are more likely than a normal distribution would predict. A negative kurtosis here means our data is slightly platykurtic, flatter and with lighter tails than a normal distribution. In practical terms, that means we have fewer extreme values than expected, which is actually good news for applying parametric tests. These numbers might look like academic trivia, but they're early-warning signals for whether your data will cooperate with the statistical tests you want to run.

Distributions Intuition

One of the most powerful shifts in statistical thinking is moving from "I have data" to "my data was generated by some underlying process." That process follows a distribution, a mathematical description of how likely different values are. When you can identify the distribution your data follows, you gain enormous analytical leverage.

Think about it this way: if you know user sessions follow an approximately normal distribution, you can immediately say things like "95% of sessions will fall within two standard deviations of the mean" without looking at every data point. You can predict what future data will look like. You can spot anomalies, values that are so unlikely under the assumed distribution that something must have changed. This is the intuition that powers everything from quality control in manufacturing to fraud detection in finance.

Different phenomena naturally follow different distributions. The normal distribution shows up when your outcome is the result of many small, independent factors adding together, think human height, measurement error, or noise in a sensor reading. The Poisson distribution models count data where events happen randomly over time or space, server requests per second, bug reports per sprint, customer arrivals per hour. The binomial distribution handles discrete yes/no outcomes repeated multiple times, conversion or not, click or not, pass or fail. Understanding which distribution fits your problem means understanding something fundamental about the process that generated your data.

The practical upside of this thinking is test selection. Most statistical tests make assumptions about the distribution of your data. A t-test assumes approximate normality. A chi-square test assumes you're working with count data from a categorical variable. Poisson regression makes sense for count outcomes. When you can look at your data and say "this looks Poisson-distributed," you immediately know which analytical tools are appropriate, and which ones will give you misleading results.

Probability Distributions: The Patterns Behind Data

Real-world data follows patterns. scipy.stats lets you work with these distributions directly, treating them as objects you can query, sample from, and use to answer probability questions.

The Normal Distribution

The normal (Gaussian) distribution is ubiquitous. Let's work with it:

python

# Create a normal distribution with mean=10, std=2
dist = stats.norm(loc=10, scale=2)
 
# Probability density at x=12
pdf_value = dist.pdf(12)
print(f"PDF at x=12: {pdf_value:.4f}")
 
# Cumulative probability: P(X ≤ 12)
cdf_value = dist.cdf(12)
print(f"P(X ≤ 12): {cdf_value:.4f}")
 
# Percent point function: What x gives us P(X ≤ x) = 0.95?
percentile_95 = dist.ppf(0.95)
print(f"95th percentile: {percentile_95:.2f}")
 
# Generate random samples
samples = dist.rvs(size=10000)

This workflow, PDF (probability density), CDF (cumulative), PPF (percent point), RVS (random samples), applies to every distribution. Once you understand these four methods, you can work with any of the 100+ distributions in scipy.stats using the exact same interface. The loc parameter shifts the center of the distribution, and scale stretches or compresses it. For a normal distribution, loc is the mean and scale is the standard deviation.

Other Critical Distributions

python

# Student's t-distribution (used in t-tests)
t_dist = stats.t(df=19)  # 20 samples means 19 degrees of freedom
t_ppf = t_dist.ppf(0.975)  # Two-tailed critical value
print(f"t-critical (n=20, α=0.05): {t_ppf:.3f}")
 
# Chi-square distribution (for categorical tests)
chi2_dist = stats.chi2(df=5)
chi2_ppf = chi2_dist.ppf(0.95)
print(f"Chi-square critical (df=5, α=0.05): {chi2_ppf:.3f}")
 
# Binomial distribution: coin flips
binom_dist = stats.binom(n=10, p=0.5)
prob_exactly_6_heads = binom_dist.pmf(6)
print(f"P(exactly 6 heads in 10 flips): {prob_exactly_6_heads:.4f}")
 
# Poisson distribution: events per interval
poisson_dist = stats.poisson(mu=3)  # Average 3 errors per day
prob_4_errors = poisson_dist.pmf(4)
print(f"P(4 errors in a day): {prob_4_errors:.4f}")

Each distribution models different data types. Binomial for discrete yes/no outcomes, Poisson for count data (errors, arrivals), t for small samples, chi-square for categorical tests. Notice the t-distribution needs df (degrees of freedom), which is typically your sample size minus one. With more degrees of freedom, the t-distribution gets progressively closer to a normal distribution, which makes intuitive sense, because with larger samples you have more information and less uncertainty.

Hypothesis Testing Mental Model

Before writing a single line of hypothesis testing code, it's worth building the right mental model for what you're actually doing. Most people learn hypothesis testing procedurally, compute this, compare to that, reject if smaller, without developing the intuitive feel for what p-values actually mean.

Here's the core idea: you start by assuming the most boring possible explanation is true. That's the null hypothesis. Then you look at your data and ask: "How surprising would this data be if the boring explanation were true?" The p-value quantifies that surprise. A small p-value means "this data would be really unusual if nothing interesting was happening", which is evidence that something interesting actually is happening.

The critical thing to internalize is what a p-value is not. It is not the probability that the null hypothesis is true. It is not the probability that you made an error. It is not a measure of effect size or practical importance. It's purely a measure of compatibility: how compatible is this data with the null hypothesis? A small p-value means low compatibility, your data and the null hypothesis don't fit together well, so you reject the null.

The significance threshold (usually 0.05) is a convention, not a law of nature. Setting α = 0.05 means you're accepting that 5% of the time, when the null is actually true, you'll reject it anyway just due to random variation. This is called a Type I error rate. In medical research where false positives are costly, you'd use 0.01 or even 0.001. In exploratory data analysis where you want to cast a wide net, 0.10 might be appropriate. The key is deciding your threshold before you look at the data, not after you see the p-value. Always decide your α ahead of time, or you're unconsciously moving the goalposts.

Hypothesis Testing: The Statistical Verdict

Hypothesis testing answers: "Is this effect real, or just random noise?"

The Framework

Every test follows this logic:

Null hypothesis (H₀): No effect exists
Alternative hypothesis (H₁): An effect exists
Test statistic: A number summarizing your evidence
P-value: Probability of observing this data if H₀ is true
Significance level (α): Threshold for rejecting H₀ (usually 0.05)

If p-value < α, reject H₀. Otherwise, fail to reject it. The smaller the p-value, the stronger your evidence.

One-Sample t-Test

Does your group differ from a hypothesized value?

python

# Claim: Average session duration is 8 minutes
# Our data suggests it's higher. Let's test.
 
null_mean = 8
t_statistic, p_value = stats.ttest_1samp(sessions, null_mean)
 
print(f"t-statistic: {t_statistic:.3f}")
print(f"p-value: {p_value:.4f}")
print(f"Mean session duration: {np.mean(sessions):.2f} minutes")
 
if p_value < 0.05:
    print("✓ Sessions are significantly longer than 8 minutes (p < 0.05)")
else:
    print("✗ No significant difference from 8 minutes")

Output:

t-statistic: 3.742
p-value: 0.0015
Mean session duration: 9.60 minutes
✓ Sessions are significantly longer than 8 minutes (p < 0.05)

The small p-value (0.0015) tells us: if true mean were 8 minutes, we'd see data this extreme less than 0.15% of the time. That's strong evidence against the null. The t-statistic of 3.742 measures how many standard errors our sample mean sits above the hypothesized value of 8, the larger that number, the more our data departs from the null hypothesis. Once you run this pattern a few times, it becomes second nature: compute, check the p-value, translate the result into plain language.

Two-Sample t-Test: Comparing Groups

Your new UI launches. Do users stay longer?

python

# Old UI session durations
old_ui = np.array([7, 8, 6, 9, 7, 8, 6, 7, 9, 8])
 
# New UI session durations
new_ui = np.array([10, 12, 9, 11, 13, 10, 12, 11, 10, 11])
 
# Assume equal variances
t_stat, p_val = stats.ttest_ind(new_ui, old_ui)
 
print(f"Old UI mean: {np.mean(old_ui):.2f} min")
print(f"New UI mean: {np.mean(new_ui):.2f} min")
print(f"Difference: {np.mean(new_ui) - np.mean(old_ui):.2f} min")
print(f"t-statistic: {t_stat:.3f}")
print(f"p-value: {p_val:.4f}")
 
if p_val < 0.05:
    print("✓ New UI significantly improves session duration")
else:
    print("✗ No significant improvement detected")

If variances differ, use equal_var=False for Welch's t-test. Always check assumptions! The question of equal versus unequal variances matters more than people realize, if you assume equal variances and they're not, you can end up with an anti-conservative test that gives you p-values that are too small, leading you to see effects that aren't there. When in doubt, Welch's t-test is the safer default because it performs nearly as well as the standard t-test when variances are equal, and performs much better when they're not.

Paired t-Test: Before-After Comparisons

Users try your feature. Same users, measured twice:

python

# Session duration: before feature, after feature
before = np.array([8, 7, 6, 9, 8, 7, 9, 8, 7, 8])
after = np.array([10, 9, 8, 11, 10, 9, 11, 10, 9, 10])
 
t_stat, p_val = stats.ttest_rel(after, before)
 
print(f"Mean before: {np.mean(before):.2f} min")
print(f"Mean after: {np.mean(after):.2f} min")
print(f"Mean difference: {np.mean(after - before):.2f} min")
print(f"t-statistic: {t_stat:.3f}")
print(f"p-value: {p_val:.4f}")

Paired tests control for individual variation, more powerful when you have matched observations. This is a genuinely important design choice. If you have the same users measured before and after, using a paired test rather than an independent samples test isn't just optional, it's the correct thing to do. The paired design essentially subtracts away user-to-user variation (the fact that some people always have long sessions and others always have short ones), leaving only the signal from your manipulation.

Chi-Square Test: Categorical Data

Is the distribution of categories random, or patterned?

python

# Feature adoption across three user cohorts
# Expected: 50/30/20 split
observed = np.array([52, 28, 20])  # Actual adoption counts
expected = np.array([50, 30, 20])  # Expected split
 
chi2_stat, p_val = stats.chisquare(observed, expected)
 
print(f"Chi-square statistic: {chi2_stat:.3f}")
print(f"p-value: {p_val:.4f}")
 
if p_val < 0.05:
    print("✓ Adoption distribution differs from expected")
else:
    print("✗ Distribution matches expectation")

Use chi2_contingency() for independence tests (does feature adoption relate to user segment?).

python

# Contingency table: Feature use × User segment
table = np.array([[45, 30],   # Segment A: 45 adopted, 30 didn't
                  [25, 55]])  # Segment B: 25 adopted, 55 didn't
 
chi2, p_val, dof, expected = stats.chi2_contingency(table)
 
print(f"Chi-square: {chi2:.3f}, p-value: {p_val:.4f}")
print("Association between segment and adoption:",
      "YES" if p_val < 0.05 else "NO")

The chi-square test for independence is one of the workhorses of product analytics. Any time you have two categorical variables and you want to know if they're related, user segment and adoption, device type and churn, plan tier and support contact rate, this is your tool. The contingency table format is worth getting comfortable with: rows are one variable, columns are the other, and each cell contains the count of observations with that particular combination of values.

Correlation: Finding Relationships

Do two variables move together?

python

# User age and session duration
age = np.array([25, 32, 45, 38, 22, 50, 28, 35, 40, 29])
duration = np.array([8, 10, 12, 11, 7, 14, 9, 11, 13, 10])
 
# Pearson correlation (assumes linear relationship)
pearson_r, pearson_p = stats.pearsonr(age, duration)
print(f"Pearson r: {pearson_r:.3f}, p-value: {pearson_p:.4f}")
 
# Spearman correlation (rank-based, handles non-linear)
spearman_rho, spearman_p = stats.spearmanr(age, duration)
print(f"Spearman ρ: {spearman_rho:.3f}, p-value: {spearman_p:.4f}")
 
# Kendall correlation (another rank method)
kendall_tau, kendall_p = stats.kendalltau(age, duration)
print(f"Kendall τ: {kendall_tau:.3f}, p-value: {kendall_p:.4f}")

Pearson assumes linearity; Spearman and Kendall are robust to outliers and monotonic relationships. All return correlation and p-value. The correlation coefficient itself (r, ρ, or τ) tells you direction and strength: values close to 1 indicate a strong positive relationship, values close to -1 indicate a strong negative relationship, and values near 0 suggest little relationship. But always look at the p-value too, a correlation of 0.8 with a p-value of 0.3 is not interesting; a correlation of 0.4 with a p-value of 0.001 is.

ANOVA: Comparing Multiple Groups

Is there a real difference among three or more groups?

python

# Session durations across three UI versions
ui_v1 = np.array([7, 8, 6, 9, 7, 8, 6, 7])
ui_v2 = np.array([9, 10, 8, 11, 9, 10, 9, 10])
ui_v3 = np.array([8, 7, 9, 6, 8, 7, 9, 8])
 
f_stat, p_val = stats.f_oneway(ui_v1, ui_v2, ui_v3)
 
print(f"F-statistic: {f_stat:.3f}")
print(f"p-value: {p_val:.4f}")
print(f"Means: V1={np.mean(ui_v1):.2f}, V2={np.mean(ui_v2):.2f}, V3={np.mean(ui_v3):.2f}")
 
if p_val < 0.05:
    print("✓ At least one UI version differs significantly")
else:
    print("✗ No significant differences among versions")

ANOVA tests the null hypothesis that all group means are equal. If rejected, use post-hoc tests (like Tukey) to find which pairs differ. This is an important nuance that trips people up: ANOVA tells you that at least one group is different, but it doesn't tell you which one. You need a post-hoc test for that. The reason we bother with ANOVA rather than just running all pairwise t-tests is the multiple comparisons problem, if you test three pairs (V1 vs V2, V1 vs V3, V2 vs V3) separately at α = 0.05, your overall false positive rate is higher than 5%. ANOVA provides a single unified test before you drill down.

Confidence Intervals: The Uncertainty Band

A point estimate (mean = 9.6) is incomplete without uncertainty. Confidence intervals quantify that uncertainty. This is one of those concepts that sounds dry but is actually deeply practical, a confidence interval is the honest way to report a result.

python

# 95% confidence interval for mean session duration
confidence = 0.95
alpha = 1 - confidence
 
# Method 1: Using t-distribution
n = len(sessions)
mean = np.mean(sessions)
std_err = stats.sem(sessions)  # Standard error of the mean
margin_error = std_err * stats.t.ppf(1 - alpha/2, n - 1)
 
ci_lower = mean - margin_error
ci_upper = mean + margin_error
 
print(f"Mean: {mean:.2f} minutes")
print(f"95% CI: [{ci_lower:.2f}, {ci_upper:.2f}]")
 
# Interpretation: We're 95% confident the true mean lies in this range.

Output:

Mean: 9.60 minutes
95% CI: [8.09, 11.11]

The wider the interval, the more uncertainty. Larger samples → narrower intervals (less uncertainty). The right way to read this result is: "We're 95% confident the true average session duration is somewhere between 8.09 and 11.11 minutes." Note that this doesn't say "there's a 95% chance the true mean is in this interval", the true mean is fixed; we just don't know it. What it says is that if we ran this experiment many times and computed this interval each time, 95% of those intervals would contain the true mean. It's a subtle but important distinction.

Bootstrap: Resampling Power

When assumptions don't hold, bootstrap resampling estimates statistics without assuming distributions. It's one of the most elegant ideas in modern statistics, instead of deriving the sampling distribution mathematically, you estimate it empirically by resampling your own data.

python

# Bootstrap 95% confidence interval for the median
n_bootstrap = 10000
bootstrap_medians = []
 
np.random.seed(42)
for _ in range(n_bootstrap):
    # Resample with replacement
    resample = np.random.choice(sessions, size=len(sessions), replace=True)
    bootstrap_medians.append(np.median(resample))
 
bootstrap_medians = np.array(bootstrap_medians)
 
# 2.5th and 97.5th percentiles form the CI
ci_lower = np.percentile(bootstrap_medians, 2.5)
ci_upper = np.percentile(bootstrap_medians, 97.5)
 
print(f"Original median: {np.median(sessions):.2f} minutes")
print(f"Bootstrap 95% CI: [{ci_lower:.2f}, {ci_upper:.2f}]")

Bootstrap makes no assumptions about distributions, powerful for real-world data. The key insight is "resampling with replacement": you draw a new sample of the same size as your original, but allow each observation to be picked multiple times. Some observations appear twice, some don't appear at all. You compute your statistic on this resample, then repeat thousands of times to build up a distribution of that statistic. The spread of that distribution is your uncertainty estimate. It works for the mean, the median, the standard deviation, correlation coefficients, any statistic you can compute.

Practical A/B Test Walkthrough

Let's tie it together. Your startup ran an A/B test. Variant B had higher conversion. But was it luck?

python

# A/B Test Results
conversions_A = 45  # Control group
visitors_A = 500
conversion_rate_A = conversions_A / visitors_A
 
conversions_B = 63  # Variant group
visitors_B = 500
conversion_rate_B = conversions_B / visitors_B
 
print(f"Control (A): {conversion_rate_A*100:.1f}% conversion")
print(f"Variant (B): {conversion_rate_B*100:.1f}% conversion")
print(f"Lift: {(conversion_rate_B - conversion_rate_A)/conversion_rate_A * 100:.1f}%")
 
# Chi-square test for independence
contingency_table = np.array([
    [conversions_A, visitors_A - conversions_A],
    [conversions_B, visitors_B - conversions_B]
])
 
chi2, p_val, dof, expected = stats.chi2_contingency(contingency_table)
 
print(f"\nChi-square test:")
print(f"χ² = {chi2:.3f}, p-value = {p_val:.4f}")
 
if p_val < 0.05:
    print(f"✓ Variant B is significantly better (p < 0.05)")
    print(f"  We'd expect this difference by chance ~{p_val*100:.1f}% of the time.")
else:
    print(f"✗ No significant difference. Result could be noise.")
 
# Binomial test alternative
# H0: p_B = p_A (same conversion rate)
binom_result = stats.binomtest(conversions_B, visitors_B, conversion_rate_A,
                                alternative='greater')
print(f"\nBinomial test (one-tailed): p-value = {binom_result.pvalue:.4f}")

Output:

Control (A): 9.0% conversion
Variant (B): 12.6% conversion
Lift: 40.0%

Chi-square test:
χ² = 6.478, p-value = 0.0109
✓ Variant B is significantly better (p < 0.05)
  We'd expect this difference by chance ~1.09% of the time.

Binomial test (one-tailed): p-value = 0.0089

Statistical Interpretation: The p-value of 0.0109 means: if both variants truly had 9% conversion, we'd see a difference this large by random chance only 1.09% of the time. That's strong evidence Variant B is genuinely better.

Business Interpretation: Launching Variant B would likely increase conversions by ~3.6 percentage points (from 9% to 12.6%). At 10,000 monthly visitors, that's ~360 additional conversions monthly.

Choosing the Right Test

One of the most practical skills in applied statistics is test selection, knowing which tool to reach for given the structure of your data and the question you're asking. Getting this wrong is surprisingly common, and it leads to results you can't trust. The good news is that a simple decision tree gets you most of the way there.

Start by identifying your outcome variable. Is it continuous (like session duration), categorical (like conversion: yes/no), or count data (like number of errors)? Continuous outcomes typically call for t-tests or ANOVA. Binary outcomes call for chi-square tests or binomial tests. Count data often needs Poisson-based approaches. This single question eliminates most of the options before you even look at sample size.

Next, consider your study design. Are you comparing one group against a known standard, two independent groups, or the same group measured twice? One group versus a benchmark: one-sample t-test. Two independent groups: two-sample t-test (or Mann-Whitney if non-normal). Same participants before and after: paired t-test. Three or more groups: ANOVA with post-hoc tests. If you need to examine relationships between two continuous variables: correlation and regression. These mappings cover the vast majority of real-world analysis situations you'll encounter.

Finally, check your assumptions. Even if you've selected the conceptually correct test, you need to verify that your data meets its requirements. The Shapiro-Wilk test checks for normality; Levene's test checks for equal variances between groups. If normality fails for small samples (n < 30), reach for the non-parametric equivalents: Mann-Whitney U instead of independent t-test, Wilcoxon signed-rank instead of paired t-test, Kruskal-Wallis instead of one-way ANOVA. These rank-based tests trade a small amount of statistical power for robustness to non-normal distributions, a trade usually worth making when the assumptions are questionable.

Common Statistical Mistakes

Let's talk about failure modes, because knowing what can go wrong is just as important as knowing the right technique. These are the mistakes that show up in real analysis all the time, even from people who should know better.

The first and most dangerous mistake is p-hacking, running many tests, looking at which ones come up significant, and reporting only those. If you test 20 hypotheses at α = 0.05, you expect one false positive just by chance. If you only report the significant ones, you're essentially lying with statistics. The fix is Bonferroni correction (divide α by the number of tests) or the Benjamini-Hochberg procedure for larger numbers of tests. Better yet, pre-register your hypotheses before you look at the data.

The second common mistake is confusing statistical significance with practical significance. With a large enough sample, even a trivially small effect becomes statistically significant. If you have 10 million users, you'll detect a 0.001% improvement in conversion rate as statistically significant, but nobody cares about a 0.001% improvement. Always compute effect sizes alongside p-values, and always ask "is this change meaningful in practice, not just detectable?"

Third is ignoring test assumptions. Many people run t-tests without ever checking whether their data is approximately normal. For large samples (n > 30), the Central Limit Theorem gives you some cover, means will be approximately normal even if the underlying data isn't. But for small samples, normality matters. Outliers matter. Checking assumptions isn't optional; it's part of responsible analysis. Use the Shapiro-Wilk test or a Q-Q plot before committing to a parametric test.

Finally, there's the peeking problem in A/B testing: looking at your data repeatedly as it comes in and stopping when you reach significance. This inflates your false positive rate dramatically. If you peek at your data 10 times during collection, your actual false positive rate at α = 0.05 might be closer to 30%. Either decide your sample size in advance (power analysis) and stick to it, or use sequential testing methods that are designed for repeated looks at accumulating data.

Common Pitfalls to Avoid

Multiple Testing: Run 20 tests, expect ~1 false positive at α=0.05. Use Bonferroni correction: divide α by number of tests.

python

# Bonferroni correction for 5 tests
alpha_corrected = 0.05 / 5
print(f"Corrected α: {alpha_corrected:.4f}")

Publication Bias: Report all tests, not just significant ones. Transparent methodology builds credibility.

Practical vs. Statistical Significance: A tiny effect can be statistically significant with massive sample sizes. Does a 0.5% improvement matter?

Assumption Violations: Check normality before t-tests. Use non-parametric alternatives when assumptions fail.

python

# Shapiro-Wilk test for normality
stat, p = stats.shapiro(sessions)
print(f"Normality test p-value: {p:.4f}")
if p > 0.05:
    print("Data appears normally distributed")
else:
    print("Consider non-parametric tests (Mann-Whitney, etc.)")

When the Shapiro-Wilk test flags non-normality, you have options. The Mann-Whitney U test is the non-parametric alternative to the independent samples t-test. The Wilcoxon signed-rank test replaces the paired t-test. The Kruskal-Wallis test replaces one-way ANOVA. These rank-based tests make fewer assumptions about your data's distribution and are surprisingly powerful even in cases where parametric tests would technically be valid.

Visualization: Make Results Clear

Statistics live in code; insights live in plots.

python

# Visualize A/B test with confidence intervals
import matplotlib.pyplot as plt
 
groups = ['Control', 'Variant']
conversions = [45, 63]
visitors = [500, 500]
rates = [c/v for c, v in zip(conversions, visitors)]
 
# Standard error for proportions
se = [np.sqrt((r * (1-r)) / v) for r, v in zip(rates, visitors)]
ci_lower = [r - 1.96*s for r, s in zip(rates, se)]
ci_upper = [r + 1.96*s for r, s in zip(rates, se)]
errors = [(r - l, u - r) for r, l, u in zip(rates, ci_lower, ci_upper)]
 
plt.figure(figsize=(8, 5))
plt.bar(groups, [r*100 for r in rates],
        yerr=[e[0]*100 for e in errors], capsize=10, alpha=0.7)
plt.ylabel('Conversion Rate (%)')
plt.title('A/B Test Results with 95% Confidence Intervals')
plt.ylim(0, 20)
for i, (g, r) in enumerate(zip(groups, rates)):
    plt.text(i, r*100 + 2, f'{r*100:.1f}%', ha='center', fontweight='bold')
plt.tight_layout()
plt.show()

A single plot communicates your statistical story better than paragraphs of text. Notice we're plotting confidence intervals here, not just the point estimates. That's the difference between a visualization that looks impressive and one that's actually honest. When confidence intervals don't overlap, you have visual confirmation of statistical significance. When they do overlap substantially, your result is probably not significant. It's a quick sanity check that anyone can read at a glance.

Statistical Significance in Practice

Understanding p-values is one thing; using them wisely in real projects is another. The most important habit to develop is always pairing statistical significance with effect size. A p-value tells you whether an effect likely exists; an effect size tells you how large that effect is. For the t-tests we ran, Cohen's d is the standard effect size metric: a value of 0.2 is considered small, 0.5 medium, and 0.8 large. For the A/B test chi-square, use Cramér's V. For correlation, the r value itself is the effect size. When you report results, lead with the effect size, "conversion improved by 3.6 percentage points", and treat the p-value as supporting evidence, not the headline. A statistically significant result with a tiny effect size is often not worth acting on, regardless of what the p-value says. Conversely, a large effect that falls just short of statistical significance in a small sample is worth investigating further with more data, not dismissing outright. Real-world statistical reasoning means holding both numbers in mind simultaneously and asking: is this finding both real and meaningful? That two-part question protects you from the most common failures in data-driven decision making, from shipping a change that produced a detectable but trivial improvement, and from killing a promising idea because your sample size was too small to detect it.

Wrapping Up: Statistical Thinking

scipy.stats isn't just a toolkit, it's a framework for rigorous thinking. Every hypothesis test, every confidence interval, every correlation coefficient is an argument grounded in probability.

The goal isn't to collect p-values like trading cards. The goal is to reason clearly about uncertainty. Data never tells you what happened with certainty; it updates your beliefs about what probably happened. Statistics is the language for expressing and communicating that uncertainty in a principled, reproducible way. A p-value of 0.03 doesn't mean you're right, it means you have reasonable evidence. A p-value of 0.06 doesn't mean nothing happened, it means your evidence didn't clear an admittedly arbitrary threshold. Understanding the framework deeply means you can use the numbers sensibly rather than treating them as verdicts from an oracle.

What separates a competent analyst from an exceptional one is rarely the ability to run more tests, it's the discipline to choose the right test, check the assumptions honestly, report results with appropriate uncertainty, and resist the temptation to over-interpret noisy data. Every technique we covered in this article serves that larger goal. Descriptive statistics tell you what you're working with. Distribution knowledge tells you what model fits your data. Hypothesis testing separates signal from noise. Confidence intervals and bootstrapping keep you honest about what you don't know. These aren't isolated tools; they're a coherent way of thinking.

As you move forward in data science and machine learning, you'll find that statistical thinking permeates everything. Feature importance is a statistical measure. Model evaluation is hypothesis testing. Cross-validation is a method for reducing variance in your performance estimates. Bayesian optimization is explicit probability reasoning. Every step of the way, the concepts from this article, distributions, uncertainty, testing, evidence, will show up in different clothing. The investment you're making now in understanding the fundamentals will pay dividends you won't fully appreciate until you're in the middle of a complex ML debugging session and you realize exactly why your model is behaving the way it is.

In ML workflows specifically, the statistical toolkit we built here slots in at multiple points. During data exploration, descriptive statistics and normality tests tell you whether to apply transformations before feeding data to your model, log-transforming a skewed feature can meaningfully improve linear model performance. During feature selection, correlation tests and chi-square independence tests help you identify which variables actually carry predictive signal, reducing overfitting risk before training even starts. When comparing two model architectures, a paired t-test on cross-validation scores gives you a rigorous answer to "is Model B actually better than Model A, or did it just get luckier on this particular split?" And when you deploy a new model version alongside an old one, everything you learned about A/B testing applies directly: design the experiment properly, choose your sample size in advance, check your significance after collecting enough data, and report effect sizes alongside p-values so stakeholders understand the practical magnitude of the improvement. Statistics is not a prerequisite you complete before doing "real" ML work, it is the foundation the entire ML workflow stands on.

Key Takeaways:

Descriptive stats summarize; hypothesis tests answer yes/no questions
Choose your test based on data type and experimental design
P-values quantify evidence; confidence intervals quantify uncertainty
Always check assumptions; use non-parametric alternatives when they fail
Bootstrap is your friend when theory breaks down
Visualization and transparency trump complexity

Next article, we'll apply these concepts in a full exploratory data analysis workflow. You'll synthesize visualization, statistics, and storytelling to turn raw data into decisions. That's where this all comes together, not as isolated techniques, but as a coherent approach to making sense of the world through data.

Statistical Analysis with scipy.stats

Why Statistics Matters for ML

Setting Up Your Toolkit

Descriptive Statistics: The Foundation

Mean, Median, Mode, and Beyond

Variance and Standard Deviation: Understanding Spread

Skewness and Kurtosis: Distribution Shape

Distributions Intuition

Probability Distributions: The Patterns Behind Data

The Normal Distribution

Other Critical Distributions

Hypothesis Testing Mental Model

Hypothesis Testing: The Statistical Verdict

The Framework

One-Sample t-Test

Two-Sample t-Test: Comparing Groups

Paired t-Test: Before-After Comparisons

Chi-Square Test: Categorical Data

Correlation: Finding Relationships

ANOVA: Comparing Multiple Groups

Confidence Intervals: The Uncertainty Band

Bootstrap: Resampling Power

Practical A/B Test Walkthrough

Choosing the Right Test

Common Statistical Mistakes

Common Pitfalls to Avoid

Visualization: Make Results Clear

Statistical Significance in Practice

Wrapping Up: Statistical Thinking

Need help implementing this?