Central Limit Theorem#
import matplotlib.pyplot as plt
import seaborn as sns; sns.set()
import numpy as np
import pandas as pd
import scipy.stats
The Central Limit Theorem (CLT) is a fundamental concept in statistics that describes the behavior of the sampling distribution of the sample mean (or other sample statistics) of a random sample from any population, regardless of its underlying distribution.
The Central Limit Theorem can be stated as follows:
When independent and identically distributed random variables are sampled from a population, the distribution of the sample means approaches a normal distribution as the sample size increases, regardless of the shape of the original population distribution.
Another slightly less precise but more concise statement of the CLT is:
The mean (or sum) of samples drawn from any distribution tends toward a Gaussian distribution.
A few key points about the CLT:
Independence: The random variables in the sample must be independent, meaning that the outcome of one observation does not affect the outcome of another.
Identical Distribution: Each random variable in the sample must be drawn from the same probability distribution.
Sample Size: As the sample size increases, the sampling distribution of the sample mean becomes increasingly closer to a normal distribution, with the mean of the sampling distribution equal to the population mean and a standard deviation (standard error) that depends on the population standard deviation and the square root of the sample size.
The CLT is a crucial theorem in probability theory and statistics and has wide-ranging applications in data analysis. It allows us to make statistical inferences about a population based on the distribution of sample means, even when we don’t know the exact nature of the population distribution. It provides the theoretical foundation for hypothesis testing, confidence intervals, and many other statistical techniques. It is also often used when dealing with large datasets, common in particle physics and astrophysics, as it allows us to assume that the distribution of sample means is approximately normal, simplifying many statistical analyses.
The following exercise provides a simple numerical demonstration of the CLT:
EXERCISE_: Fill in the function below to generate nsample
samples of size size
from an arbitrary 1D continuous distribution, and then display an sns.distplot
of the mean values of each generated sample:
def central_limit_demo(dist, nsample, size, seed=123):
gen = np.random.RandomState(seed=seed)
...
Test your function using:
central_limit_demo(scipy.stats.uniform(scale=1), nsample=100, size=100)
central_limit_demo(scipy.stats.lognorm(s=0.5), nsample=100, size=100)
def central_limit_demo(dist, nsample, size, seed=123,verbose=False):
gen = np.random.RandomState(seed=seed)
means = []
full_data = []
for i in range(nsample):
data = dist.rvs(size=size, random_state=gen)
means.append(np.mean(data))
full_data.append(data)
if verbose:
print(data)
if verbose:
print(means)
fig, axs = plt.subplots(ncols=2)
sns.histplot(full_data,ax=axs[0],legend=False)
sns.histplot(means, kde=True, stat="density", kde_kws=dict(cut=3),ax=axs[1])
This code draws size samples from the distributions nsample times. The left plot looks at each one of these samples individually and each one of those should resemble the distribution from which the distribution is drawn (in the case below this is a Uniform distribution). Each one of these distributions is plotted in the left plot.
Then we take the mean of each one of these separate sets of size size. We will have nsample values of the mean. These values are plotted in the right plot.
For the uniform distribution over the range 0-1, we expect the mean of the sample to be 0.5. The larger size is, the closer we expect each of the means to be to 0.5.
central_limit_demo(scipy.stats.uniform(scale=1), nsample=500, size=5, verbose=False)
The Central Limit Theorem tells us that as the number of samples becomes large, the distribution of the means becomes a Gaussian, regardless of the shape of the distribution drawn from.
Not surprisingly, any measurement subject to multiple sources of fluctuations is likely to follow a distribution that can be approximated with a Gaussian distribution to a good approximation, regardless of the specific details of the processes at play.
Here is another CLT simulator (similar to what we have here).
Acknowledgments#
Initial version: Anne Sickles
© Copyright 2024