ZetCode

Python SciPy Stats

last modified March 8, 2025

This tutorial explores statistical analysis in Python using the scipy.stats module, part of the SciPy library, ideal for advanced data science tasks.

The scipy.stats module offers tools for descriptive statistics, probability distributions, and hypothesis testing, far exceeding the basic capabilities of Python's statistics module.

Install SciPy

$ pip install scipy

Install SciPy using pip to access the stats module and its powerful statistical functions.

SciPy Stats Mean and Median

The mean is the average of a dataset, while the median is the middle value when sorted, robust against outliers.

mean_median.py
#!/usr/bin/python

from scipy import stats
import numpy as np

sales = [1200, 1500, 1300, 1700, 5000]  # Monthly sales in USD
mean = np.mean(sales)
median = np.median(sales)

print(f"Mean Sales: ${mean:.2f}")
print(f"Median Sales: ${median:.2f}")

We import scipy.stats and numpy for efficient array operations. The sales list represents monthly revenue. We use np.mean and np.median for calculations.

The mean ($2,340) is skewed by the outlier ($5,000), while the median ($1,500) better reflects typical sales, showing scipy.stats integration with NumPy.

$ ./mean_median.py
Mean Sales: $2340.00
Median Sales: $1500.00

SciPy Stats Mode

The mode identifies the most frequent value in a dataset, useful for categorical or discrete data analysis.

mode.py
#!/usr/bin/python

from scipy import stats

ratings = [5, 4, 5, 3, 4, 5, 2, 4, 5]  # Customer satisfaction scores
mode_result = stats.mode(ratings)

print(f"Mode: {mode_result.mode[0]}, Count: {mode_result.count[0]}")

The ratings list simulates customer feedback (1-5 scale). stats.mode returns a ModeResult object with mode (most frequent value) and count (its frequency). Here, 5 appears 4 times.

$ ./mode.py
Mode: 5, Count: 4

SciPy Stats Variance and Standard Deviation

The variance measures data spread as the average squared difference from the mean. The standard deviation, its square root, indicates typical deviation in original units.

var_stdev.py
#!/usr/bin/python

from scipy import stats
import numpy as np

temps = [22.5, 23.0, 21.8, 24.1, 22.9]  # Daily temperatures (°C)
variance = np.var(temps, ddof=1)  # Sample variance
stdev = np.std(temps, ddof=1)     # Sample standard deviation

print(f"Variance: {variance:.2f} °C²")
print(f"Standard Deviation: {stdev:.2f} °C")

We use temperature data to compute sample variance and standard deviation (ddof=1 for sample, not population). Variance (0.85 °C²) shows spread in squared units, while standard deviation (0.92 °C) is more interpretable.

$ ./var_stdev.py
Variance: 0.85 °C²
Standard Deviation: 0.92 °C

SciPy Stats Normal Distribution

The normal distribution models continuous data with a bell-shaped curve, defined by mean and standard deviation, common in natural phenomena.

normal_dist.py
#!/usr/bin/python

from scipy import stats

# IQ scores: mean=100, std=15
iq_dist = stats.norm(loc=100, scale=15)
prob_above_120 = 1 - iq_dist.cdf(120)
sample = iq_dist.rvs(size=5)

print(f"P(IQ > 120): {prob_above_120:.3f}")
print(f"Random IQs: {sample}")

We model IQ scores with a normal distribution (loc = mean, scale = std). cdf(120) gives the cumulative probability up to 120, so 1 - cdf is the chance of exceeding 120. rvs generates random samples.

This is useful for simulations or probability assessments in psychology or education.

$ ./normal_dist.py
P(IQ > 120): 0.091
Random IQs: [ 95.2 112.7  88.4 104.1  99.6]  # Values may vary

SciPy Stats T-Test

A t-test compares means to assess if differences are statistically significant, widely used in hypothesis testing.

ttest.py
#!/usr/bin/python

from scipy import stats

group1 = [85, 88, 90, 87, 86]  # Test scores, method A
group2 = [90, 92, 89, 94, 91]  # Test scores, method B
t_stat, p_val = stats.ttest_ind(group1, group2)

print(f"T-Statistic: {t_stat:.2f}")
print(f"P-Value: {p_val:.3f}")

We compare test scores from two teaching methods. ttest_ind performs an independent t-test, returning the t-statistic and p-value. A low p-value (<0.05) suggests a significant difference.

Here, p=0.013 indicates method B likely improves scores, a common analysis in educational research.

$ ./ttest.py
T-Statistic: -2.98
P-Value: 0.013

SciPy Stats Correlation

The correlation coefficient measures the linear relationship between two variables, ranging from -1 (perfect negative) to 1 (perfect positive).

correlation.py
#!/usr/bin/python

from scipy import stats

hours = [2, 3, 4, 5, 6]         # Study hours
scores = [65, 70, 75, 85, 90]  # Exam scores
r, p = stats.pearsonr(hours, scores)

print(f"Pearson Correlation: {r:.2f}")
print(f"P-Value: {p:.3f}")

We test if study hours correlate with exam scores. pearsonr computes Pearson's correlation coefficient and p-value. A high r (0.97) and low p-value (0.006) confirm a strong positive relationship.

$ ./correlation.py
Pearson Correlation: 0.97
P-Value: 0.006

SciPy Stats Kurtosis

The kurtosis measures the "tailedness" of a distribution, indicating whether data has heavy or light tails compared to a normal distribution. Positive kurtosis means heavy tails (more outliers), while negative means light tails.

kurtosis.py
#!/usr/bin/python

from scipy import stats
import numpy as np

# Daily stock returns (%) for a tech company
returns = [0.5, -0.3, 1.2, -2.5, 0.8, 3.1, -1.8, 0.2, 4.0, -3.2]
kurt = stats.kurtosis(returns)

print(f"Kurtosis of Returns: {kurt:.2f}")

We analyze daily stock returns, a common financial dataset. The returns list simulates percentage changes in stock price over 10 days. stats.kurtosis calculates the excess kurtosis (relative to a normal distribution, where kurtosis = 0).

A kurtosis of 0.73 indicates heavier tails than a normal distribution, suggesting more extreme price swings—useful for risk assessment in finance.

$ ./kurtosis.py
Kurtosis of Returns: 0.73

SciPy Stats Skew

The skew measures the asymmetry of a distribution. Positive skew means a longer right tail (e.g., income data), while negative skew indicates a longer left tail (e.g., time to failure).

skew.py
#!/usr/bin/python

from scipy import stats
import numpy as np

# Annual incomes (thousands USD) in a small town
incomes = [25, 30, 35, 40, 45, 50, 60, 80, 120, 200]
skewness = stats.skew(incomes)

print(f"Skewness of Incomes: {skewness:.2f}")

The incomes list represents annual earnings, typical of economic data with a few high earners. stats.skew computes the skewness. A positive value (1.46) confirms a right-skewed distribution, common in income studies.

This helps economists understand wealth distribution and identify inequality trends in the population.

$ ./skew.py
Skewness of Incomes: 1.46

SciPy Stats Find Repeats

The find_repeats function identifies repeated values in an array and their counts, useful for spotting patterns or anomalies in discrete data.

find_repeats.py
#!/usr/bin/python

from scipy import stats
import numpy as np

# Customer purchase counts in a week
purchases = [1, 2, 3, 2, 4, 1, 2, 5, 1, 3]
repeats = stats.find_repeats(purchases)

print(f"Repeated Values: {repeats.values}")
print(f"Counts: {repeats.counts}")

The purchases array tracks how many items each customer bought. stats.find_repeats returns a FindRepeatsResult object with values (repeated numbers) and counts (how often they repeat).

Here, 1 and 2 appear multiple times (3 and 3), indicating frequent small purchases. This is valuable for retail analysis to optimize inventory or promotions.

$ ./find_repeats.py
Repeated Values: [1. 2. 3.]
Counts: [3 3 2]

SciPy Stats Describe

The describe function provides a summary of a dataset, including count, mean, variance, skew, kurtosis, and min/max, offering a quick overview.

describe.py
#!/usr/bin/python

from scipy import stats
import numpy as np

# Patient wait times (minutes) in a clinic
wait_times = [10, 15, 12, 20, 25, 18, 30, 22, 35, 40]
summary = stats.describe(wait_times)

print(f"Number of Observations: {summary.nobs}")
print(f"Mean: {summary.mean:.2f} minutes")
print(f"Variance: {summary.variance:.2f} min²")
print(f"Skewness: {summary.skewness:.2f}")
print(f"Kurtosis: {summary.kurtosis:.2f}")
print(f"Min: {summary.minmax[0]}, Max: {summary.minmax[1]}")

The wait_times list simulates patient wait times in a healthcare setting. stats.describe returns a DescribeResult object with key statistics, accessed via attributes like nobs (count) and mean.

The summary shows a mean of 22.7 minutes, slight positive skew (0.47), and negative kurtosis (-0.73), suggesting a flatter distribution. This helps clinics assess service efficiency and patient experience.

$ ./describe.py
Number of Observations: 10
Mean: 22.70 minutes
Variance: 90.46 min²
Skewness: 0.47
Kurtosis: -0.73
Min: 10, Max: 40

SciPy Stats Linear Regression

Linear regression models the relationship between a dependent variable and one or more independent variables, predicting trends.

regression.py
#!/usr/bin/python

from scipy import stats

ad_spend = [100, 200, 300, 400, 500]  # Ad budget ($)
sales = [1200, 1500, 1800, 2100, 2400] # Sales ($)
slope, intercept, r, p, se = stats.linregress(ad_spend, sales)

print(f"Slope: {slope:.2f}, Intercept: {intercept:.2f}")
print(f"R²: {r**2:.3f}, P-Value: {p:.3f}")

We model sales against ad spending. linregress returns slope, intercept, correlation coefficient (r), p-value, and standard error. The slope (2.4) suggests $1 in ads yields $2.40 in sales.

R² (1.0) indicates a perfect fit, though real data would show more variation. This is useful for marketing analysis.

$ ./regression.py
Slope: 2.40, Intercept: 960.00
R²: 1.000, P-Value: 0.000

Best Practices

Source

SciPy stats documentation

This tutorial showcased scipy.stats for advanced statistical analysis, from distributions to regression, with practical examples.

Author

My name is Jan Bodnar, and I am a passionate programmer with extensive programming experience. I have been writing programming articles since 2007. To date, I have authored over 1,400 articles and 8 e-books. I possess more than ten years of experience in teaching programming.

List all Python tutorials.