Important Probability Distributions

Important Probability Distributions#

%matplotlib inline
import matplotlib.pyplot as plt
import seaborn as sns; sns.set()
import numpy as np
import pandas as pd

We will make use of the stats package within the scipy suite of tools for scientific python

import scipy.stats

Overview#

Is this section we provide a brief overview of probability distributions that follows from the previous proabability theory lecture.

What is a Probability Distribution?#

A probability distribution is a mathematical function that defines the likelihood of different outcomes or values of a random variable. Probability distributions are fundamental in probability theory and statistics and useful for for analyzing scientific data and making predictions. We more formally refer to these functions as probability density functions (PDFs) when they are normalized to unity such that the intetral over their domain equals 1.

A brief description of probability distributions, often encountered in practical applications, is presented in the following. The rationale leading to this choice of PDFs is driven either by their specific mathematical properties, and/or in view of their common usage in the modelling of important physical processes; such features are correspondingly emphasized in the discussion.

Zoology of PDFs in SciPy#

There are many named statistical distributions with useful properties and/or interesting history. For example, in the scipy.stats module we find a large number of 1D continuous random variable distributions:

len(scipy.stats.rv_continuous.__subclasses__())

and a smaller number of 1D discrete distributions:

len(scipy.stats.rv_discrete.__subclasses__())

and multidimensional continuous distributions:

len(scipy.stats._multivariate.multi_rv_generic.__subclasses__())

Some Useful Probabiltiy Distributions#

You will likely never need all or even most of these probability distributions in practical application, but it is useful to know about them. Most have special applications that they were created for, but can also be useful as building blocks of an empirical distribution when you just want something that looks like the data.

1D Continuous Distributions#

It is useful to group the large number of 1D continuous distributions according to their general shape. We will use the function below for a quick visual tour of some PDFs in each group:

def pdf_demo(xlo, xhi, **kwargs):
    x = np.linspace(xlo, xhi, 200)
    cmap = sns.color_palette().as_hex()
    for i, name in enumerate(kwargs):
        for j, arglist in enumerate(kwargs[name].split(';')):
            args = eval('dict(' + arglist + ')')
            y = eval('scipy.stats.' + name)(**args).pdf(x)
            plt.plot(x, y, c=cmap[i], ls=('-','--',':')[j], label=name)
    plt.xlim(x[0], x[-1])
    plt.legend(fontsize='large')

First, the centered symmetric peaked distributions, including the ubiquitous Gaussian (here called “norm” for normal) and Lorentzian (here called “cauchy”). Each of these can be re-centered using the loc (location) parameter and broadened using scale:

pdf_demo(-5, +5, laplace='scale=1', norm='scale=1', cauchy='scale=1', logistic='scale=1')

The Gaussian Distribution in 1D and 2D#

The Gaussian distribution is commonly found in many scientific applications (and in a wide range of other cases), for example representing the response function of an observable in an experimental apparatus with finite resolution. It is given by

\[\Large P(x; \mu, \sigma) = \frac{1}{{\sigma \sqrt{2\pi}}} \exp\left(-\frac{(x - \mu)^2}{2\sigma^2}\right)\]

where

$\mu$ represents the mean (average) of the distribution.
$\sigma$ represents the standard deviation of the distribution.
$x$ is the random variable.

# Define parameters
mu = 0       # Mean
sigma = 1    # Standard deviation

# Create data points
x = np.linspace(-5, 5, 1000)  # Range of x values

# Calculate the 1D Gaussian function
gaussian = (1 / (sigma * np.sqrt(2 * np.pi))) * np.exp(-0.5 * ((x - mu) / sigma) ** 2)

# Plot the Gaussian function
plt.plot(x, gaussian, label='1D Gaussian')
plt.xlabel('x')
plt.ylabel('Probability Density')
plt.legend()
plt.title('1D Gaussian Distribution')
plt.grid(True)
plt.show()

#let's check that the Gaussian is a properly normalized PDF
# we have 1000 points over dx = 10 so each point is separated by 10/1000; gaussian contains the value of the PDF at each point
print(np.sum(gaussian) * 10/1000)

The Gaussian can be generalized to multiple dimensions.

In 2D, it can be written generally as: $$ P(x_1, x_2; \mu_1, \mu_2, \sigma_1, \sigma_2, \rho) $$ = \frac{1}{{\sigma_1 \sigma_2 {2\pi\sqrt{1-\rho^2}}}} \exp\left(-\frac{1}{2\left( 1-\rho^2 \right)} \left[ \left(\frac{x_1 - \mu_1}{\sigma_1}\right)^2 + \left(\frac{x_2 - \mu_2}{\sigma_2}\right)^2 - 2\rho \left( \frac{x_1-\mu_1}{\sigma_1}\right) \left( \frac{x_2 - \mu_2}{\sigma_2} \right) \right] \right) $$

This contains the expected additional, mean, width and random variable, however there is now an additional parameter $\rho$. This is the correlation coefficient between $x_1$ and $x_2$. This will be discussed more next week, but for now, keep in mind that:

$\rho = 1$ means $x_1$ and $x_2$ are perfectly correlated,
$\rho = -1$ means that $x_1$ and $x_2$ are perfectly anti-correlated
$\rho = 0$ means that $x_1$ and $x_2$ are independent

For the higher dimensional case the same pattern is followed though the notation will require vectors and matrices and there are additional correlations.

# Define parameters
mu_x = 0        # Mean along x-axis
mu_y = 0        # Mean along y-axis
sigma_x = 1     # Standard deviation along x-axis
sigma_y = 1     # Standard deviation along y-axis
rho = 0.5       # correlation coefficient

# Create a grid of x and y values
x = np.linspace(-5, 5, 1000)
y = np.linspace(-5, 5, 1000)
X, Y = np.meshgrid(x, y)

# Calculate the 2D Gaussian function
gaussian = (
    1 / (2 * np.pi * sigma_x * sigma_y * np.sqrt(1-rho*rho))
) * np.exp(
    -1.0/(2*(1-rho*rho)) *((X - mu_x)**2 / (sigma_x**2) + (Y - mu_y)**2 / (sigma_y**2) -
      2*rho*(X - mu_x)*(Y - mu_y)/(sigma_x * sigma_y))
)

# Plot the 2D Gaussian function as a heatmap
plt.imshow(gaussian, cmap='viridis', extent=(-5, 5, -5, 5), origin='lower')
plt.colorbar(label='Probability Density')
plt.xlabel('x')
plt.ylabel('y')
plt.title('2D Gaussian Distribution')
plt.show()

#let's check that the Gaussian is a properly normalized PDF
# we have 1000 points over dx = 10 so each point is separated by 10/1000 and the same division in y;
#gaussian contains the value of the PDF at each point
print(np.sum(gaussian) * 10/1000 * 10/1000)

Coding up the Gaussian by hand is an interesting exercise, but in general you should use the version available in scipy or some other well validated library.

The Breit-Wigner distribution#

The Breit-Wigner distribution is a probability distribution often used in physics to describe resonances or the shape of spectral lines. It is also known by several other names and variations, depending on the context and field of study. The Breit-Wigner is also called Lorentzian by physicists, and in mathematics circles it is often referred to as the Cauchy distribution.

In the context of relativistic kinematics, the Breit-Wigner function provides a good description of a resonant process (for example the invariant mass of decay products from a resonant intermediate state); for a resonance, the parameters x0 and Γ are referred to as its mass and its natural width, respectively.

\[\Large P(x; \Gamma, x_0) = \frac{1}{\pi}\frac{\Gamma}{(x - x_0)^2 + \Gamma^2}\]

whose parameters are the most probable value $x_0$ (which specifies the peak of the distribution), and the FWHM $\Gamma$.

With some representative parameters, it look like this:

# Define the Breit-Wigner distribution function
def breit_wigner(x, gamma, peak, center):
    return (gamma / 2) / ((x - center)**2 + (gamma / 2)**2) + peak

# Parameters for the distribution
gamma = 2.0  # Width parameter (half-width at half-maximum)
peak = 1.0   # Peak height
center = 0.0  # Center position

# Generate x values
x = np.linspace(-10, 10, 400)

# Calculate the corresponding y values using the Breit-Wigner function
y = breit_wigner(x, gamma, peak, center)

# Create a plot
plt.figure(figsize=(8, 6))
plt.plot(x, y, label=f'Breit-Wigner (γ={gamma}, peak={peak}, center={center})')
plt.title('Breit-Wigner Distribution')
plt.xlabel('x')
plt.ylabel('Probability Density')
plt.legend()
plt.grid(True)
plt.show()

The Breit-Wigner probability distribution has a peculiar feature, as a consequence of its long-range tails: the empirical average and empirical RMS are ill-defined (their variance increase with the size of the samples), and cannot be used as estimators of the Breit-Wigner parameters. The truncated mean and interquartile range, which are obtained by removing events in the low and high ends of the sample, are safer estimators of the Breit-Wigner parameters

There are also some more specialized asymmetric “bump” distributions that are particularly useful for modeling histograms of reconstructed particle masses (even more specialized is the Cruijff function used here):

pdf_demo(0, 10, crystalball='beta=1,m=3,loc=6;beta=10,m=3,loc=6', argus='chi=0.5,loc=5,scale=2;chi=1.5,loc=5,scale=2')

The Crystal Ball distribution#

The Crystal Ball function is named after the Crystal Ball Collaboration. The Crystal Ball was a hermetic particle detector used initially with the SPEAR particle accelerator at the Stanford Linear Accelerator Center beginning in 1979. It was designed to detect neutral particles and was used to discover the $\eta_c$ meson.

The Crystal Ball function is a probability density function commonly used to model various lossy processes in high-energy physics. It consists of a Gaussian core portion and a power-law low-end tail, below a certain threshold. The function itself is:

\[\begin{split}\Large f(x; \alpha, n, \mu, \sigma) = \frac{1}{\sqrt{2\pi}\sigma}\left\{ \begin{array}{ll} e^{-\frac{(x - \mu)^2}{2\sigma^2}} & \text{if } \frac{x - \mu}{\sigma} > -\alpha \\ A\left(B - \frac{x - \mu}{\sigma}\right)^{-n} & \text{if } \frac{x - \mu}{\sigma} \leq -\alpha \end{array} \right. \end{split}\]

Both the function and its first derivative are both continuous.

Crystall Ball functions were used to model the Higgs boson di-photon signal shape in the $H\to\gamma\gamma$ decay channel for the 2012 Higgs boson discovery by the ATLAS and CMS Collaborations.

Next, the “one-sided” distributions that are only defined for $x \ge 0$. Most of these smoothly transition from peaking at the origin to peaking at $x > 0$ as some parameter is varied. These are useful building blocks for quantities that must be positive, e.g., errors or densities.

pdf_demo(-0.5, 5, gamma='a=1;a=2;a=3', lognorm='s=0.5;s=1.5', chi2='df=2;df=3')

The Gamma distribution#

The Gamma distribution can be written as: $$ f(x; a) = \frac{x^{a-1}e^{-x}}{\Gamma(a)} $$ where $\Gamma(a)$ is the [gamma *function*](https://en.wikipedia.org/wiki/Gamma_function), not the value of the gamma *distribution* at some value. There are a couple of other ways to write the gamma distribution but this is the version used in scipy. $a$ has to be a real positive number.

One reason to discuss the gamma distribution is that other distributions are special cases of it.

The Exponential distribution#

The exponential distribution is a special case of the Gamma distribution with the shape parameter $a=1$, although the exponential distribution is only supported for $x \ge 0$. The exponential distribution is one of the widely used continuous distributions. It is often used to model the time elapsed between events, such as the decay of unstable nuclei. A common application of this exponential distribution is the description of phenomena occuring independently at a constant rate, such as decay lengths and lifetimes.

It is given by:

\[\begin{split}\Large f(x;\lambda) = \begin{cases} \lambda e^{-\lambda x} & \text{for } x \geq 0 \\ 0 & \text{for } x < 0 \end{cases} \end{split}\]

where $\lambda$ is the rate parameter and the mean of the distribution.

The classic example of the exponential distribution in the physical world is the decay of an unstable particle. In this case, $\lambda$ is the mean lifetime of the particle.

The Chi-square distribution#

The chi-square distribution ($\chi^2$) is especially useful for parameter fitting, goodness-of-fit measures, and confidence interval calculations. It is given by:

\[\Large f(x;n) = \frac{1}{{2^{n/2}\Gamma(n/2)}} x^{(n/2)-1} e^{-x/2}\]

where $x$ is the random variable, $n$ is the number of degrees of freedom (which is a positive integer), and $\Gamma$ is the gamma function (again not the gamma distribution).

The chi-square distribution is another special case of the gamma distribution. Here we need to set the scale parameter on the gamma distribution, but then the gamma distribution with $a=1$ is identical to the chi-square distribution with $df = 2$.

pdf_demo(-0.5, 5, gamma='a=1,scale=2', chi2='df=2')

Bounded Distributions#

Another useful group are the “bounded” distributions. These distributions are distinct from the distributions above in that they are defined for a finite range in $x$.

The simplest one of these distributions is the uniform distribution which is exactly what it sounds like:

\[\begin{split}\Large f(x;a) = \begin{cases} \frac{1}{a} & 0 \leq x \leq a \\ 0 & otherwise \end{cases} \end{split}\]

pdf_demo(-0.1,2.1,uniform='scale=1;scale=2')

Two other interesting bounded distributions are the cosine: $$f(x) = \frac{1}{2\pi}\left(1+\cos(x)\right)$$ and defined of the range of $-\pi \leq x \leq \pi$ and the beta distribution.

pdf_demo(-0.1, 1.1, beta='a=.7,b=.7;a=1.5,b=1.5;a=.9,b=1.5', cosine='loc=0.5,scale=0.159',
         uniform='scale=1')

All of the 1D continuous distributions share the same API (via a common base class), and allow you to perform useful operations including:

Evaluate the PDF, log(PDF), CDF or log(CDF).
Generate random values from the distribution.
Calculate expectation values of moment functions or even arbitrary functions.

The API also allows you to estimate the values of distribution parameters that best describe some data, but we will soon see better approaches to this “regression” problem.

def rv_continuous_demo(xlo, xhi, dist, seed=123, n=500):
    x = np.linspace(xlo, xhi, 200)
    P = dist.pdf(x)
    plt.plot(x, P, ls='-', label='PDF $P(x)$')

    g = lambda x: (0.7 - 0.3 * (x - xlo) / (xhi - xlo)) * np.percentile(P, 95)
    plt.plot(x, g(x), ':', label='$g(x)$')
    print('<g> =', dist.expect(g))

    gen = np.random.RandomState(seed=seed)
    data = dist.rvs(size=n, random_state=gen)
    plt.hist(data, range=(xlo, xhi), bins=20, histtype='stepfilled',
             alpha=0.25, density=True, stacked = True, label='Random')

    plt.ylim(0, None)
    plt.grid(axis='y')
    plt.legend(loc='upper left', fontsize='large')
    rhs = plt.twinx()
    rhs.set_ylabel('CDF')
    rhs.set_ylim(0., 1.05)
    rhs.grid('off')
    rhs.plot(x, dist.cdf(x), 'k--')
    plt.xlim(xlo, xhi)

rv_continuous_demo(-4, +4, scipy.stats.norm())

rv_continuous_demo(0, 1, scipy.stats.beta(a=1.5,b=0.9))

1D Discrete Distributions#

All the distributions above are defined for some range of the real numbers. However, some distributions are defined only for integers.

There are some essential discrete random variable distributions, where the PDF is replaced with what is called in probabilty and statistics a probability mass function (PMF):

def pmf_demo(klo, khi, **kwargs):
    k = np.arange(klo, khi + 1)
    cmap = sns.color_palette().as_hex()
    histopts = {'bins': khi-klo+1, 'range': (klo-0.5,khi+0.5), 'histtype': 'step', 'lw': 2}
    for i, name in enumerate(kwargs):
        for j, arglist in enumerate(kwargs[name].split(';')):
            args = eval('dict(' + arglist + ')')
            y = eval('scipy.stats.' + name)(**args).pmf(k)
            plt.hist(k, weights=y, color=cmap[i], ls=('-','--',':')[j], label=name, **histopts)
    plt.legend(fontsize='large')

The PMF is also known as the discrete probability density function (dPDF). (Again in this area the terminology is non-uniform and generally confusing. Be careful.)

The Binomial Distribution#

Consider a scenario with two possible outcomes: “success” or “failure”, with a fixed probability $p$ of “success” being realized (this is also called a Bernouilli trial). If $n$ trials are performed, $0 ≤ k ≤ n$ may actually result in “success”; it is assumed that the sequence of trials is irrelevant, and only the number of “success” $k$ is considered of interest. The integer number $k$ follows the so-called binomial distribution $P(k; n, p)$:

\[\Large P(k; n,p) = \binom{n}{k} p^k (1-p)^{n-k}\]

where $\binom{n}{k}$ is the binomial coefficient, which can be calculated as

\[\Large \binom{n}{k} = \frac{n!}{k!(n-k)!}\]

The binomial distribution is related to the beta distribution previously shown, and describes the statistics of ratios of counting measurements (efficiency, purity, completeness, …):

pmf_demo(0, 10, binom='n=10,p=0.2;n=10,p=0.5;n=10,p=0.8')

The Poisson Distribution

In the $n → ∞$, $~p → 0$ limit (with $λ = np$ finite and non-zero) for the Binomial distribution, the random variable $k$ follows the Poisson distribution $P(k; \lambda)$,

\[\Large P(k; \lambda) = \frac{e^{-\lambda} \lambda^k}{k!}\]

where $\lambda$ is the rate parameter defining the average rate at which events occur.

The Poisson distribution is sometimes called the “law of rare events” in view of the $p → 0$ limit. It is a useful model for describing the statistics of event-counting rates in (uncorrelated) counting measurements (which are ubiquitous in astronomy and particle physics)

pmf_demo(0, 10, poisson='mu=2;mu=3;mu=4')

Poisson distribution is used to model the number of events occuring in the future, In comparison to the previously described Exponential and Gamma distributions, the Exponential distribution is used to predict the wait time until the very first event, and Gamma distribution is used to predict the wait time until the $\alpha$-th event.

The Gaussian Approximation of the Poisson & Binomial Distributions

The Gaussian is also encountered as the limiting distribution for the Binomial and and Poisson ones, in the large $n$ and large $\lambda$ limits, respectively:

\[\Large P_\text{binomial} (k; n → ∞, p) → P_\text{Gauss} (k; np, np(1 − p))\]

\[\Large P_\text{Poisson} (k; λ → ∞) → P_\text{Gauss} (k; λ, \sqrt{\lambda})\]

Note that, when using a Gaussian as approximation, an appropriate continuity correction needs to be taken into account: the range of the Gaussian extends to negative values, while Binomial and Poisson are only defined in the positive range.

Statistical Toolboxes

For a more powerful statistical toolbox than scipy.stats, try the python statsmodels package. Going beyond python, the R language is designed specifically for “statistical computing” (and integrated in jupyter), and RooFit is a “a toolkit for modeling the expected distribution of events in a physics analysis” integrated into the ROOT framework.

Acknowledgments#

Initial version: Mark Neubauer; revisions, Anne Sickles