Python Statistics
last modified February 15, 2025
In this article, we show how to use the Python statistics
module to
perform statistical calculations such as mean, median, variance, and standard
deviation.
The statistics
module provides functions to calculate mathematical
statistics of numeric data. It is part of the Python standard library and is
useful for basic statistical operations.
Python statistics mean
The mean is the average of a dataset. It is calculated by summing all the values and dividing by the number of values.
#!/usr/bin/python import statistics data = [1, 2, 3, 4, 5, 6, 7, 8, 9, 10] mean = statistics.mean(data) print(f"Mean: {mean}")
$ ./mean.py Mean: 5.5
Python statistics median
The median is the middle value of a dataset when the values are arranged in order. If the dataset has an even number of values, the median is the average of the two middle values.
#!/usr/bin/python import statistics data = [1, 2, 3, 4, 5, 6, 7, 8, 9, 10] median = statistics.median(data) print(f"Median: {median}")
$ ./median.py Median: 5.5
Python statistics mode
The mode is the value that appears most frequently in a dataset. A dataset may have no mode (if all values are unique) or multiple modes (if multiple values have the same highest frequency).
#!/usr/bin/python import statistics data = [1, 2, 2, 3, 3, 3, 4, 4, 4, 4] mode = statistics.mode(data) print(f"Mode: {mode}")
$ ./mode.py Mode: 4
Python statistics variance
The variance measures how far each number in the dataset is from the mean. It is calculated as the average of the squared differences from the mean.
#!/usr/bin/python import statistics data = [1, 2, 3, 4, 5, 6, 7, 8, 9, 10] variance = statistics.variance(data) print(f"Variance: {variance}")
$ ./variance.py Variance: 9.166666666666666
Python statistics standard deviation
The standard deviation is a measure of the amount of variation or dispersion in a dataset. It is the square root of the variance and provides a measure of how spread out the values are.
#!/usr/bin/python import statistics data = [1, 2, 3, 4, 5, 6, 7, 8, 9, 10] stdev = statistics.stdev(data) print(f"Standard Deviation: {stdev}")
$ ./stdev.py Standard Deviation: 3.0276503540974917
Python statistics quantiles
Quantiles are values that divide a dataset into equal-sized intervals. Common quantiles include quartiles (dividing the data into 4 parts), deciles (10 parts), and percentiles (100 parts). The first quartile (Q1) is the 25th percentile, the second quartile (Q2) is the median (50th percentile), and the third quartile (Q3) is the 75th percentile.
The interquartile range (IQR) is the difference between Q3 and Q1 and is used to identify outliers.
#!/usr/bin/python import statistics data = [10, 12, 12, 14, 12, 11, 14, 13, 15, 100] q1, median, q3 = statistics.quantiles(data, n=4, method='exclusive') print(f"Q1: {q1}, Median: {median}, Q3: {q3}") iqr = q3 - q1 lower_bound = q1 - 1.5 * iqr upper_bound = q3 + 1.5 * iqr filtered_data = [e for e in data if e >= lower_bound and e <= upper_bound] print(f"Filtered Data: {filtered_data}")
q1, median, q3 = statistics.quantiles(data, n=4, method='exclusive')
We calculate the first quartile (Q1), median (Q2), and third quartile (Q3) using
the statistics.quantiles
function. The n=4
argument
specifies quartiles, and the method='exclusive'
argument specifies
the calculation method.
iqr = q3 - q1 lower_bound = q1 - 1.5 * iqr upper_bound = q3 + 1.5 * iqr
We calculate the interquartile range (IQR) and use it to determine the lower and upper bounds for identifying outliers.
filtered_data = [e for e in data if e >= lower_bound and e <= upper_bound]
We filter the dataset to exclude outliers based on the calculated bounds.
$ ./quantiles.py Q1: 12.0, Median: 13.0, Q3: 14.0 Filtered Data: [10, 12, 12, 14, 12, 11, 14, 13, 15]
Python statistics correlation
Correlation is a measure that expresses the extent to which two variables are linearly related, meaning they change together at a constant rate. It's a common tool for describing simple relationships without making a statement about cause and effect.
The correlation is quantified by the correlation coefficient,
often denoted as r
, which measures the strength of the linear
relationship between two variables. The correlation coefficient ranges from
-1
to +1
. A correlation coefficient close to
0
indicates a weak linear relationship. Positive r
values indicate a positive correlation, while negative values indicate a
negative correlation.
#!/usr/bin/python import statistics orbital_period = [88, 225, 365, 687, 4331, 10_756, 30_687, 60_190] # days dist_from_sun = [58, 108, 150, 228, 778, 1_400, 2_900, 4_500] # million km cor = statistics.correlation(orbital_period, dist_from_sun, method='ranked') print(f"Correlation Coefficient: {cor}")
orbital_period = [88, 225, 365, 687, 4331, 10_756, 30_687, 60_190] # days dist_from_sun = [58, 108, 150, 228, 778, 1_400, 2_900, 4_500] # million km
We define two datasets: orbital_period
(the orbital period of
planets in days) and dist_from_sun
(the distance of planets from
the sun in million kilometers).
cor = statistics.correlation(orbital_period, dist_from_sun, method='ranked')
We calculate the correlation coefficient using the
statistics.correlation
function. The method='ranked'
argument specifies that the calculation should use ranked data, which is useful
for non-linear relationships or when outliers are present.
print(f"Correlation Coefficient: {cor}")
We print the correlation coefficient, which quantifies the strength and direction of the linear relationship between the two variables.
$ ./correlation.py Correlation Coefficient: 1.0
The output shows a correlation coefficient of 1.0
, indicating a
perfect positive linear relationship between the orbital period and the distance
from the sun.
Source
Python statistics module documentation
In this article, we have worked with the Python statistics
module to
perform statistical calculations.
Author
List all Python tutorials.