Python SciPy Stats
last modified March 8, 2025
This tutorial explores statistical analysis in Python using the
scipy.stats module, part of the SciPy library, ideal for advanced
data science tasks.
The scipy.stats module offers tools for descriptive statistics,
probability distributions, and hypothesis testing, far exceeding the basic
capabilities of Python's statistics module.
Install SciPy
$ pip install scipy
Install SciPy using pip to access the stats module
and its powerful statistical functions.
SciPy Stats Mean and Median
The mean is the average of a dataset, while the median is the middle value when sorted, robust against outliers.
#!/usr/bin/python
from scipy import stats
import numpy as np
sales = [1200, 1500, 1300, 1700, 5000] # Monthly sales in USD
mean = np.mean(sales)
median = np.median(sales)
print(f"Mean Sales: ${mean:.2f}")
print(f"Median Sales: ${median:.2f}")
We import scipy.stats and numpy for efficient array
operations. The sales list represents monthly revenue. We use
np.mean and np.median for calculations.
The mean ($2,340) is skewed by the outlier ($5,000), while the median ($1,500)
better reflects typical sales, showing scipy.stats integration
with NumPy.
$ ./mean_median.py Mean Sales: $2340.00 Median Sales: $1500.00
SciPy Stats Mode
The mode identifies the most frequent value in a dataset, useful for categorical or discrete data analysis.
#!/usr/bin/python
from scipy import stats
ratings = [5, 4, 5, 3, 4, 5, 2, 4, 5] # Customer satisfaction scores
mode_result = stats.mode(ratings)
print(f"Mode: {mode_result.mode[0]}, Count: {mode_result.count[0]}")
The ratings list simulates customer feedback (1-5 scale).
stats.mode returns a ModeResult object with
mode (most frequent value) and count (its frequency).
Here, 5 appears 4 times.
$ ./mode.py Mode: 5, Count: 4
SciPy Stats Variance and Standard Deviation
The variance measures data spread as the average squared difference from the mean. The standard deviation, its square root, indicates typical deviation in original units.
#!/usr/bin/python
from scipy import stats
import numpy as np
temps = [22.5, 23.0, 21.8, 24.1, 22.9] # Daily temperatures (°C)
variance = np.var(temps, ddof=1) # Sample variance
stdev = np.std(temps, ddof=1) # Sample standard deviation
print(f"Variance: {variance:.2f} °C²")
print(f"Standard Deviation: {stdev:.2f} °C")
We use temperature data to compute sample variance and standard deviation
(ddof=1 for sample, not population). Variance (0.85 °C²) shows
spread in squared units, while standard deviation (0.92 °C) is more
interpretable.
$ ./var_stdev.py Variance: 0.85 °C² Standard Deviation: 0.92 °C
SciPy Stats Normal Distribution
The normal distribution models continuous data with a bell-shaped curve, defined by mean and standard deviation, common in natural phenomena.
#!/usr/bin/python
from scipy import stats
# IQ scores: mean=100, std=15
iq_dist = stats.norm(loc=100, scale=15)
prob_above_120 = 1 - iq_dist.cdf(120)
sample = iq_dist.rvs(size=5)
print(f"P(IQ > 120): {prob_above_120:.3f}")
print(f"Random IQs: {sample}")
We model IQ scores with a normal distribution (loc = mean,
scale = std). cdf(120) gives the cumulative
probability up to 120, so 1 - cdf is the chance of exceeding 120.
rvs generates random samples.
This is useful for simulations or probability assessments in psychology or education.
$ ./normal_dist.py P(IQ > 120): 0.091 Random IQs: [ 95.2 112.7 88.4 104.1 99.6] # Values may vary
SciPy Stats T-Test
A t-test compares means to assess if differences are statistically significant, widely used in hypothesis testing.
#!/usr/bin/python
from scipy import stats
group1 = [85, 88, 90, 87, 86] # Test scores, method A
group2 = [90, 92, 89, 94, 91] # Test scores, method B
t_stat, p_val = stats.ttest_ind(group1, group2)
print(f"T-Statistic: {t_stat:.2f}")
print(f"P-Value: {p_val:.3f}")
We compare test scores from two teaching methods. ttest_ind
performs an independent t-test, returning the t-statistic and p-value. A low
p-value (<0.05) suggests a significant difference.
Here, p=0.013 indicates method B likely improves scores, a common analysis in educational research.
$ ./ttest.py T-Statistic: -2.98 P-Value: 0.013
SciPy Stats Correlation
The correlation coefficient measures the linear relationship between two variables, ranging from -1 (perfect negative) to 1 (perfect positive).
#!/usr/bin/python
from scipy import stats
hours = [2, 3, 4, 5, 6] # Study hours
scores = [65, 70, 75, 85, 90] # Exam scores
r, p = stats.pearsonr(hours, scores)
print(f"Pearson Correlation: {r:.2f}")
print(f"P-Value: {p:.3f}")
We test if study hours correlate with exam scores.
pearsonr computes Pearson's correlation coefficient and p-value.
A high r (0.97) and low p-value (0.006) confirm a strong positive
relationship.
$ ./correlation.py Pearson Correlation: 0.97 P-Value: 0.006
SciPy Stats Kurtosis
The kurtosis measures the "tailedness" of a distribution, indicating whether data has heavy or light tails compared to a normal distribution. Positive kurtosis means heavy tails (more outliers), while negative means light tails.
#!/usr/bin/python
from scipy import stats
import numpy as np
# Daily stock returns (%) for a tech company
returns = [0.5, -0.3, 1.2, -2.5, 0.8, 3.1, -1.8, 0.2, 4.0, -3.2]
kurt = stats.kurtosis(returns)
print(f"Kurtosis of Returns: {kurt:.2f}")
We analyze daily stock returns, a common financial dataset. The
returns list simulates percentage changes in stock price over 10
days. stats.kurtosis calculates the excess kurtosis (relative to
a normal distribution, where kurtosis = 0).
A kurtosis of 0.73 indicates heavier tails than a normal distribution, suggesting more extreme price swings—useful for risk assessment in finance.
$ ./kurtosis.py Kurtosis of Returns: 0.73
SciPy Stats Skew
The skew measures the asymmetry of a distribution. Positive skew means a longer right tail (e.g., income data), while negative skew indicates a longer left tail (e.g., time to failure).
#!/usr/bin/python
from scipy import stats
import numpy as np
# Annual incomes (thousands USD) in a small town
incomes = [25, 30, 35, 40, 45, 50, 60, 80, 120, 200]
skewness = stats.skew(incomes)
print(f"Skewness of Incomes: {skewness:.2f}")
The incomes list represents annual earnings, typical of economic
data with a few high earners. stats.skew computes the skewness.
A positive value (1.46) confirms a right-skewed distribution, common in income
studies.
This helps economists understand wealth distribution and identify inequality trends in the population.
$ ./skew.py Skewness of Incomes: 1.46
SciPy Stats Find Repeats
The find_repeats function identifies repeated values in an array and their counts, useful for spotting patterns or anomalies in discrete data.
#!/usr/bin/python
from scipy import stats
import numpy as np
# Customer purchase counts in a week
purchases = [1, 2, 3, 2, 4, 1, 2, 5, 1, 3]
repeats = stats.find_repeats(purchases)
print(f"Repeated Values: {repeats.values}")
print(f"Counts: {repeats.counts}")
The purchases array tracks how many items each customer bought.
stats.find_repeats returns a FindRepeatsResult
object with values (repeated numbers) and counts
(how often they repeat).
Here, 1 and 2 appear multiple times (3 and 3), indicating frequent small purchases. This is valuable for retail analysis to optimize inventory or promotions.
$ ./find_repeats.py Repeated Values: [1. 2. 3.] Counts: [3 3 2]
SciPy Stats Describe
The describe function provides a summary of a dataset, including count, mean, variance, skew, kurtosis, and min/max, offering a quick overview.
#!/usr/bin/python
from scipy import stats
import numpy as np
# Patient wait times (minutes) in a clinic
wait_times = [10, 15, 12, 20, 25, 18, 30, 22, 35, 40]
summary = stats.describe(wait_times)
print(f"Number of Observations: {summary.nobs}")
print(f"Mean: {summary.mean:.2f} minutes")
print(f"Variance: {summary.variance:.2f} min²")
print(f"Skewness: {summary.skewness:.2f}")
print(f"Kurtosis: {summary.kurtosis:.2f}")
print(f"Min: {summary.minmax[0]}, Max: {summary.minmax[1]}")
The wait_times list simulates patient wait times in a healthcare
setting. stats.describe returns a DescribeResult
object with key statistics, accessed via attributes like nobs
(count) and mean.
The summary shows a mean of 22.7 minutes, slight positive skew (0.47), and negative kurtosis (-0.73), suggesting a flatter distribution. This helps clinics assess service efficiency and patient experience.
$ ./describe.py Number of Observations: 10 Mean: 22.70 minutes Variance: 90.46 min² Skewness: 0.47 Kurtosis: -0.73 Min: 10, Max: 40
SciPy Stats Linear Regression
Linear regression models the relationship between a dependent variable and one or more independent variables, predicting trends.
#!/usr/bin/python
from scipy import stats
ad_spend = [100, 200, 300, 400, 500] # Ad budget ($)
sales = [1200, 1500, 1800, 2100, 2400] # Sales ($)
slope, intercept, r, p, se = stats.linregress(ad_spend, sales)
print(f"Slope: {slope:.2f}, Intercept: {intercept:.2f}")
print(f"R²: {r**2:.3f}, P-Value: {p:.3f}")
We model sales against ad spending. linregress returns slope,
intercept, correlation coefficient (r), p-value, and standard
error. The slope (2.4) suggests $1 in ads yields $2.40 in sales.
R² (1.0) indicates a perfect fit, though real data would show more variation. This is useful for marketing analysis.
$ ./regression.py Slope: 2.40, Intercept: 960.00 R²: 1.000, P-Value: 0.000
Best Practices
- Use NumPy Arrays: Convert lists to arrays for efficiency.
- Check Assumptions: Verify normality for t-tests.
- Interpret P-Values: Small p (<0.05) suggests significance.
- Document Models: Note distribution parameters used.
Source
This tutorial showcased scipy.stats for advanced statistical
analysis, from distributions to regression, with practical examples.
Author
List all Python tutorials.