Python SciPy Stats
last modified March 8, 2025
This tutorial explores statistical analysis in Python using the
scipy.stats
module, part of the SciPy library, ideal for advanced
data science tasks.
The scipy.stats
module offers tools for descriptive statistics,
probability distributions, and hypothesis testing, far exceeding the basic
capabilities of Python's statistics
module.
Install SciPy
$ pip install scipy
Install SciPy using pip
to access the stats
module
and its powerful statistical functions.
SciPy Stats Mean and Median
The mean is the average of a dataset, while the median is the middle value when sorted, robust against outliers.
#!/usr/bin/python from scipy import stats import numpy as np sales = [1200, 1500, 1300, 1700, 5000] # Monthly sales in USD mean = np.mean(sales) median = np.median(sales) print(f"Mean Sales: ${mean:.2f}") print(f"Median Sales: ${median:.2f}")
We import scipy.stats
and numpy
for efficient array
operations. The sales
list represents monthly revenue. We use
np.mean
and np.median
for calculations.
The mean ($2,340) is skewed by the outlier ($5,000), while the median ($1,500)
better reflects typical sales, showing scipy.stats
integration
with NumPy.
$ ./mean_median.py Mean Sales: $2340.00 Median Sales: $1500.00
SciPy Stats Mode
The mode identifies the most frequent value in a dataset, useful for categorical or discrete data analysis.
#!/usr/bin/python from scipy import stats ratings = [5, 4, 5, 3, 4, 5, 2, 4, 5] # Customer satisfaction scores mode_result = stats.mode(ratings) print(f"Mode: {mode_result.mode[0]}, Count: {mode_result.count[0]}")
The ratings
list simulates customer feedback (1-5 scale).
stats.mode
returns a ModeResult
object with
mode
(most frequent value) and count
(its frequency).
Here, 5 appears 4 times.
$ ./mode.py Mode: 5, Count: 4
SciPy Stats Variance and Standard Deviation
The variance measures data spread as the average squared difference from the mean. The standard deviation, its square root, indicates typical deviation in original units.
#!/usr/bin/python from scipy import stats import numpy as np temps = [22.5, 23.0, 21.8, 24.1, 22.9] # Daily temperatures (°C) variance = np.var(temps, ddof=1) # Sample variance stdev = np.std(temps, ddof=1) # Sample standard deviation print(f"Variance: {variance:.2f} °C²") print(f"Standard Deviation: {stdev:.2f} °C")
We use temperature data to compute sample variance and standard deviation
(ddof=1
for sample, not population). Variance (0.85 °C²) shows
spread in squared units, while standard deviation (0.92 °C) is more
interpretable.
$ ./var_stdev.py Variance: 0.85 °C² Standard Deviation: 0.92 °C
SciPy Stats Normal Distribution
The normal distribution models continuous data with a bell-shaped curve, defined by mean and standard deviation, common in natural phenomena.
#!/usr/bin/python from scipy import stats # IQ scores: mean=100, std=15 iq_dist = stats.norm(loc=100, scale=15) prob_above_120 = 1 - iq_dist.cdf(120) sample = iq_dist.rvs(size=5) print(f"P(IQ > 120): {prob_above_120:.3f}") print(f"Random IQs: {sample}")
We model IQ scores with a normal distribution (loc
= mean,
scale
= std). cdf(120)
gives the cumulative
probability up to 120, so 1 - cdf
is the chance of exceeding 120.
rvs
generates random samples.
This is useful for simulations or probability assessments in psychology or education.
$ ./normal_dist.py P(IQ > 120): 0.091 Random IQs: [ 95.2 112.7 88.4 104.1 99.6] # Values may vary
SciPy Stats T-Test
A t-test compares means to assess if differences are statistically significant, widely used in hypothesis testing.
#!/usr/bin/python from scipy import stats group1 = [85, 88, 90, 87, 86] # Test scores, method A group2 = [90, 92, 89, 94, 91] # Test scores, method B t_stat, p_val = stats.ttest_ind(group1, group2) print(f"T-Statistic: {t_stat:.2f}") print(f"P-Value: {p_val:.3f}")
We compare test scores from two teaching methods. ttest_ind
performs an independent t-test, returning the t-statistic and p-value. A low
p-value (<0.05) suggests a significant difference.
Here, p=0.013 indicates method B likely improves scores, a common analysis in educational research.
$ ./ttest.py T-Statistic: -2.98 P-Value: 0.013
SciPy Stats Correlation
The correlation coefficient measures the linear relationship between two variables, ranging from -1 (perfect negative) to 1 (perfect positive).
#!/usr/bin/python from scipy import stats hours = [2, 3, 4, 5, 6] # Study hours scores = [65, 70, 75, 85, 90] # Exam scores r, p = stats.pearsonr(hours, scores) print(f"Pearson Correlation: {r:.2f}") print(f"P-Value: {p:.3f}")
We test if study hours correlate with exam scores.
pearsonr
computes Pearson's correlation coefficient and p-value.
A high r
(0.97) and low p-value (0.006) confirm a strong positive
relationship.
$ ./correlation.py Pearson Correlation: 0.97 P-Value: 0.006
SciPy Stats Kurtosis
The kurtosis measures the "tailedness" of a distribution, indicating whether data has heavy or light tails compared to a normal distribution. Positive kurtosis means heavy tails (more outliers), while negative means light tails.
#!/usr/bin/python from scipy import stats import numpy as np # Daily stock returns (%) for a tech company returns = [0.5, -0.3, 1.2, -2.5, 0.8, 3.1, -1.8, 0.2, 4.0, -3.2] kurt = stats.kurtosis(returns) print(f"Kurtosis of Returns: {kurt:.2f}")
We analyze daily stock returns, a common financial dataset. The
returns
list simulates percentage changes in stock price over 10
days. stats.kurtosis
calculates the excess kurtosis (relative to
a normal distribution, where kurtosis = 0).
A kurtosis of 0.73 indicates heavier tails than a normal distribution, suggesting more extreme price swings—useful for risk assessment in finance.
$ ./kurtosis.py Kurtosis of Returns: 0.73
SciPy Stats Skew
The skew measures the asymmetry of a distribution. Positive skew means a longer right tail (e.g., income data), while negative skew indicates a longer left tail (e.g., time to failure).
#!/usr/bin/python from scipy import stats import numpy as np # Annual incomes (thousands USD) in a small town incomes = [25, 30, 35, 40, 45, 50, 60, 80, 120, 200] skewness = stats.skew(incomes) print(f"Skewness of Incomes: {skewness:.2f}")
The incomes
list represents annual earnings, typical of economic
data with a few high earners. stats.skew
computes the skewness.
A positive value (1.46) confirms a right-skewed distribution, common in income
studies.
This helps economists understand wealth distribution and identify inequality trends in the population.
$ ./skew.py Skewness of Incomes: 1.46
SciPy Stats Find Repeats
The find_repeats function identifies repeated values in an array and their counts, useful for spotting patterns or anomalies in discrete data.
#!/usr/bin/python from scipy import stats import numpy as np # Customer purchase counts in a week purchases = [1, 2, 3, 2, 4, 1, 2, 5, 1, 3] repeats = stats.find_repeats(purchases) print(f"Repeated Values: {repeats.values}") print(f"Counts: {repeats.counts}")
The purchases
array tracks how many items each customer bought.
stats.find_repeats
returns a FindRepeatsResult
object with values
(repeated numbers) and counts
(how often they repeat).
Here, 1 and 2 appear multiple times (3 and 3), indicating frequent small purchases. This is valuable for retail analysis to optimize inventory or promotions.
$ ./find_repeats.py Repeated Values: [1. 2. 3.] Counts: [3 3 2]
SciPy Stats Describe
The describe function provides a summary of a dataset, including count, mean, variance, skew, kurtosis, and min/max, offering a quick overview.
#!/usr/bin/python from scipy import stats import numpy as np # Patient wait times (minutes) in a clinic wait_times = [10, 15, 12, 20, 25, 18, 30, 22, 35, 40] summary = stats.describe(wait_times) print(f"Number of Observations: {summary.nobs}") print(f"Mean: {summary.mean:.2f} minutes") print(f"Variance: {summary.variance:.2f} min²") print(f"Skewness: {summary.skewness:.2f}") print(f"Kurtosis: {summary.kurtosis:.2f}") print(f"Min: {summary.minmax[0]}, Max: {summary.minmax[1]}")
The wait_times
list simulates patient wait times in a healthcare
setting. stats.describe
returns a DescribeResult
object with key statistics, accessed via attributes like nobs
(count) and mean
.
The summary shows a mean of 22.7 minutes, slight positive skew (0.47), and negative kurtosis (-0.73), suggesting a flatter distribution. This helps clinics assess service efficiency and patient experience.
$ ./describe.py Number of Observations: 10 Mean: 22.70 minutes Variance: 90.46 min² Skewness: 0.47 Kurtosis: -0.73 Min: 10, Max: 40
SciPy Stats Linear Regression
Linear regression models the relationship between a dependent variable and one or more independent variables, predicting trends.
#!/usr/bin/python from scipy import stats ad_spend = [100, 200, 300, 400, 500] # Ad budget ($) sales = [1200, 1500, 1800, 2100, 2400] # Sales ($) slope, intercept, r, p, se = stats.linregress(ad_spend, sales) print(f"Slope: {slope:.2f}, Intercept: {intercept:.2f}") print(f"R²: {r**2:.3f}, P-Value: {p:.3f}")
We model sales against ad spending. linregress
returns slope,
intercept, correlation coefficient (r
), p-value, and standard
error. The slope (2.4) suggests $1 in ads yields $2.40 in sales.
R² (1.0) indicates a perfect fit, though real data would show more variation. This is useful for marketing analysis.
$ ./regression.py Slope: 2.40, Intercept: 960.00 R²: 1.000, P-Value: 0.000
Best Practices
- Use NumPy Arrays: Convert lists to arrays for efficiency.
- Check Assumptions: Verify normality for t-tests.
- Interpret P-Values: Small p (<0.05) suggests significance.
- Document Models: Note distribution parameters used.
Source
This tutorial showcased scipy.stats
for advanced statistical
analysis, from distributions to regression, with practical examples.
Author
List all Python tutorials.