import matplotlib.pyplot as plt
import seaborn as sns; sns.set()
import numpy as np
import pandas as pd
from random import gauss
from random import random
from numpy.linalg import inv
!pip install pyunfold
from pyunfold import iterative_unfold
from pyunfold.callbacks import Logger
import matplotlib.cm as cm
import scipy as sp
Response, Unfolding & Convolution#
Response#
Your measurement isn’t perfect. For example:
there is a finite resolution to the detector. Often this will take the place of a Gaussian (or Gaussian-like) smearing around the true value.
your measurement could be systematically off in scale (e.g. your ruler is 0.5cm too long)
you can have an inefficiency in your detector either from the finite apature or some random loss.
These effects all go into the measurement response.
Every measurement suffers from this to some degree. The issue is that experiments with different responses can leave to different measurements.
Mathetmatically: $\(\vec{M} = \bf{R} \vec{T}\)\( If two experiments have different responses \)\bf{R_a}\( and \)\bf{R_b}\( but are measuring the same quantity \)\vec{T}\(, then \)\vec{M_a}\( and \)\vec{M_b}$ will be different.
\(\bf{R}\) is generally a matrix and it can contain:
inefficency of the detector
the inherent resolution of the measurement process
other effects
Determining the response#
The main way in which experimenters determine the response of a measurement is through simulations. Here the simulations model the interaction of a simulated singal (similar to what is expected for the true values) are input into a model of the detector and the expected output is given. This process is a Monte Carlo.
This process can be incredibly detailed and usually requires more computing power than analyzing the data itself. In many parts of physics/medicine, Geant is used. This simulates the passage of particles and radiation through matter based on decades of measurements for specific materials and energies. Other examples are various ray tracking programs for optics. From this, it is possible to determine the response matrix.
Anything requiring a simulation is of course only as good as the simulation itself and all of these tools are imperfect to some degree. Usually experimenters try to do as many checks as possible of the simulation using data itself.
(This process of generating the response from a given model is actually a convolution which we’ll discuss at the end.)
Incorporating the Response into Interpretation#
If the \(\bf{R}\) and \(\vec{M}\) are known then \(\vec{T}\) is constained. There is a choice as to whether to report \(\bf{R}\) and \(\vec{M}\) or to try to constrain \(\vec{T}\) itself. The advantage of reporting \(\bf{R}\) and \(\vec{M}\) is the simplicity. The disadvantage, clearly, is not reporting the actual quantity of interest \(\vec{T}\).
However, getting to \(\vec{T}\) can be a complicated process, as we’ll see below.
The most obvious way to get \(\vec{T}\) is to invert \(\bf{R}\) to get: $\(\bf{R^{-1}}\vec{M} = \bf{R^{-1}}\bf{R} \vec{T}\)\( \)\(\bf{R^{-1}}\vec{M} = \vec{T}\)$
We’ll consider that as a first step, and then discuss unfolding which is a more stable proceedure to get \(\vec{T}\) from \(\bf{R}\) and \(\vec{M}\).
First we’ll generate samples, num_samples, according to a distribution. This is what nature made for us.
We’re going to have four samples here:
Two are based on reality:
data_true this is nature. It’s what would be measured in a perfect detector. This is unknown to you.
data_reco this is what we actually measure. It’s the dataset you have.
The next two are based on a model:
MC_true this is the model for your simulation. It is based only on your model and not on your detector.
MC_reco this is what your model would give in your detector if it were true.
The goal is to use MC_true and MC_reco to unfold data_reco to data_unfolded that is your best understanding of data_true.
All of the above are values themselves, when we put them in histograms, we’ll append “_hist” to the name.
sns.set_context(context='poster')
plt.rcParams['figure.figsize'] = (10, 8)
plt.rcParams['lines.markeredgewidth'] = 2
hrange = 2
nbins = 51
num_samples = int(1e4)
#true_samples = np.random.normal(loc=0.0, scale=1.0, size=num_samples)
MC_true = np.random.exponential(scale=1, size=num_samples)
data_true = np.random.exponential(scale=1, size=num_samples)
bins = np.linspace(0, hrange, nbins)
num_bins = len(bins) - 1
data_true_hist, _ = np.histogram(data_true, bins=bins)
MC_true_hist, _ = np.histogram(MC_true, bins=bins)
fig, ax = plt.subplots()
ax.step(np.arange(num_bins), MC_true_hist, where='mid', lw=3,
alpha=0.7, label='True distribution')
ax.set(xlabel='X bins', ylabel='Counts')
ax.legend()
plt.show()
Now let’s mess up the truth. Instead of a detailed Monte Carlo in this case we will just add noise via a shift in the mean with a Gaussian smearing. This is what we think our detector actually measures given these truth samples.
noise_width = 0.5
noise_shift = 1.0
random_noise = np.random.normal(loc=noise_shift, scale=noise_width, size=num_samples)
MC_reco = MC_true * random_noise
MC_reco_hist, _ = np.histogram(MC_reco, bins=bins)
fig, ax = plt.subplots()
ax.step(np.arange(num_bins), MC_true_hist, where='mid', lw=3,
alpha=0.7, label='MC True')
ax.step(np.arange(num_bins), MC_reco_hist, where='mid', lw=3,
alpha=0.7, label='MC Reco')
ax.set(xlabel='X bins', ylabel='Counts')
ax.legend()
plt.show()
Now, we create the response matrix (response_hist). This is a 2D histogram that contains the correlation between each point in MC_true and MC_reco.
The software requires an efficiency histogram. In our case, we’ll assume perfect efficiency, so it’s just a histogram of ones.
MC_reco_err_hist = np.sqrt(MC_reco_hist)
efficiencies = np.ones_like(MC_reco_hist, dtype=float)
efficiencies
efficiencies_err = np.full_like(efficiencies, 0.0, dtype=float)
efficiencies_err
response_hist, _, _ = np.histogram2d(MC_reco, MC_true, bins=bins)
response_hist_err = np.sqrt(response_hist)
Here we are going to require each column (truth bin) to sum up to the efficiency value (here we’ve just taken all the efficiencies to be unity so all the columns will add up to one). This is equivalent to saying that every truth value will be measured somewhere.
column_sums = response_hist.sum(axis=0)
normalization_factor = efficiencies / column_sums
response = response_hist * normalization_factor
response_err = response_hist_err * normalization_factor
#checking...
response.sum(axis=0)
inv_response_hist = inv(response)
fig, ax = plt.subplots()
im = ax.imshow(response, origin='lower')
cbar = plt.colorbar(im, label='$P(E_i|C_{\mu})$')
ax.set(xlabel='MC_true', ylabel='MC_reco',
title='Normalized response matrix')
plt.show()
So now we have the response matrix and it’s inverse so we can try: $\(\bf{R^{-1}}\vec{M} = \vec{T}\)$
Let’s just try this only on the MC_* sample. Can we recover MC_true by this process?
MC_reco_inv = np.matmul(inv_response_hist,MC_reco_hist)
fig, ax = plt.subplots()
ax.step(np.arange(num_bins), MC_true_hist, where='mid', lw=3,
alpha=0.7, label='MC_true')
ax.step(np.arange(num_bins), MC_reco_hist, where='mid', lw=3,
alpha=0.7, label='MC_reco')
ax.step(np.arange(num_bins), MC_reco_inv, where='mid', lw=3,
alpha=0.7, label='from matrix inversion')
ax.set(xlabel='X bins', ylabel='Counts')
ax.legend()
plt.show()
Well that didn’t work well. The green curve looks vaguely like the original distribution but it has far too many fluctuations to be useful in any way. Why? Well, let’s take a look at this inverse matrix. What does it look like when we just let the truth equal the observations? It’s easy to go back and turn the noise down to zero. Seeing that then the inversion works is a trivial but important check.
(Also, run the process of generating the “observed” points that go into the response matrix multiple times; you will see that the inverse of the response matrix changes noticably depending on the particular random number set.)
Finally, the matrix inversion method requires the response to be a square matrix; from the physics point of view, the response can be square, but need not be.
fig, ax = plt.subplots()
im = ax.imshow(inv_response_hist, origin='lower')
cbar = plt.colorbar(im, label='')
ax.set(xlabel='Cause bins', ylabel='Effect bins',
title='Inverse of response matrix')
plt.show()
Unfolding#
So where do we go from here? The basic idea is unfolding. This improves on the matrix inversion by including some additional information. Usually there is a preference for a smooth final distribution. This usually makes sense from a physical point of view but there are some clear dangers here.
We’re going to discuss Iterative Bayesian Unfolding. This is based on Bayes’ Theorem (go back to the beginning when we discussed the basics of probability). This method is documented here. This is used extensively at the LHC, neutrino detectors, DM searches…
Let’s start with Bayes theorem: $\( P(C_i|E) = \frac{P(E|C_i) P(C_i)}{\sum_i P(E|C_i) P(C_i)}\)\( where the \)C_i\( are each independent causes which can produce a single effect \)E\(. Here the conditional probability of cause \)C_i\( to produce the effect \)E\( is \)P(E|C_i)\( and the initial probability of causes is \)P(C_i)$.
One observes some number of effects \(E\), \(n(E)\) and one wants to know the distribution of causes, \(n(C_i)\) that causes those effects: $\(\hat{n}(C_i) = n(E) P(C_i|E) \)\( Most measurements are complicated and procues a number of effects, so for each effect \)j\(, Bayes theorem applies: \)\(P(C_i|E_j) = \frac{P(E_j|C_i)P_{a}(C_i)}{n(E)}\)\( \)\(P(C_i|E_j) = \frac{P(E_j|C_i)P_{a}(C_i)}{\sum_lP(E_j|C_l)P_a(C_l)}\)$
Given a guess at \(P_{a}(C_i)\) (e.g. the prior) and our understanding from simulation of how the pdf for each effect bin in terms of the cause bin, \(P(E_j|C_i)\), is we can get a constraint on \(P(C_i|E_j)\) and then we have both terms on the rhs of: $\(\hat{n}(C_i) = n(E) P(C_i|E) \)$
Then we iterate. We calcuate \(\hat{n}(C_i)\) based on the intial guess and then use, those values as the new \(P_a(C_i)\). We iterate as many times as necessary until \(\hat{n}(C_i)\) is stable from one iteration to the next (we’ll discuss how to make this qualitative statement quantitative below).
First, let’s just try this on the MC itself and verify that we can recover MC_true_hist by unfolding MC_reco_hist.
We have everything we need for that, except a criteria for stopping the iteration.
ts_values = [0.0001]
print(ts_values)
unfolded_results = []
closure = []
closure_err = []
for i in range(1):
tmp_results = iterative_unfold(data=MC_reco_hist,
data_err=MC_reco_err_hist,
response=response,
response_err=response_err,
efficiencies=efficiencies,
ts_stopping=ts_values[i],
#ts='rmd',
efficiencies_err=efficiencies_err,
callbacks=[Logger()])
unfolded_results.append(tmp_results)
tmp_closure = tmp_results['unfolded'] / MC_true_hist
tmp_closure_err = tmp_results['sys_err'] / MC_true_hist
closure.append(tmp_closure)
closure_err.append(tmp_closure_err)
print(len(unfolded_results), len(closure_err))
colors = cm.rainbow(np.linspace(0, 1, len(ts_values)))
print(type(colors))
fig, ax = plt.subplots(1,2,figsize=(15, 10))
ax[0].step(np.arange(num_bins), MC_true_hist, where='mid', lw=3,
alpha=0.7, label='True distribution')
ax[0].step(np.arange(num_bins), MC_reco_hist, where='mid', lw=3,
alpha=0.7, label='Observed distribution')
for i in range(len(ts_values)):
ax[0].errorbar(np.arange(num_bins), unfolded_results[i]['unfolded'],
yerr=unfolded_results[i]['sys_err'],
alpha=0.7,
elinewidth=3,
capsize=4,
color=colors[i],
ls='None', marker='.', ms=10,
label='Bayesian unf. dist.')
ax[0].set(xlabel='X bins', ylabel='Counts')
ax[1].errorbar(np.arange(num_bins),closure[i], yerr=closure_err[i], color=colors[i])
ax[1].set(xlabel='X bins', ylabel='unfolded / truth')
plt.legend(loc='lower center')
plt.show()
Priors#
As with anything Bayesian, the selection of the prior central.
Choosing an appropriate prior affects the time to reach a the stopping condition. In PyUnfold, the default prior is a uniform distribution. Depending on the shape of the distribution and the knowledge you have about what the truth should be, this might or might not be an optimal choice.
In the example above, the truth distribution falls like an exponential. There are many cases where one could know that the distribution should be steeply falling like that and that might be a better place to start. Here we’ll set the prior to an exponential but not with the same slope as the real distribution.
alt_prior = np.random.exponential(scale=0.9, size=num_samples)
alt_prior_hist, _ = np.histogram(alt_prior, bins=bins, density=True)
alt_prior_hist /= np.sum(alt_prior_hist)
ts_values = [0.1, 0.01, 0.001, 0.0001]
alt_unfolded_results = []
alt_closure = []
alt_closure_err = []
for i in range(4):
tmp_results = iterative_unfold(data=MC_reco_hist,
data_err=MC_reco_err_hist,
response=response,
response_err=response_err,
efficiencies=efficiencies,
ts_stopping=ts_values[i],
prior=alt_prior_hist,
#ts='rmd',
efficiencies_err=efficiencies_err,
callbacks=[Logger()])
alt_unfolded_results.append(tmp_results)
tmp_closure = tmp_results['unfolded'] / MC_true_hist
tmp_closure_err = tmp_results['sys_err'] / MC_true_hist
alt_closure.append(tmp_closure)
alt_closure_err.append(tmp_closure_err)
colors = cm.rainbow(np.linspace(0, 1, len(ts_values)))
print(type(colors))
fig, ax = plt.subplots(1,2,figsize=(15, 10))
ax[0].step(np.arange(num_bins), MC_true_hist, where='mid', lw=3,
alpha=0.7, label='True distribution')
ax[0].step(np.arange(num_bins), MC_reco_hist, where='mid', lw=3,
alpha=0.7, label='Observed distribution')
for i in range(len(ts_values)):
labelwords = 'unfolded, stop: ' + str(ts_values[i])
ax[0].errorbar(np.arange(num_bins), alt_unfolded_results[i]['unfolded'],
yerr=alt_unfolded_results[i]['sys_err'],
alpha=0.5,
elinewidth=3,
capsize=4,
color=colors[i],
ls='None', marker='.', ms=10,
label=labelwords)
ax[0].set(xlabel='X bins', ylabel='Counts')
ax[1].errorbar(np.arange(num_bins),alt_closure[i], yerr=alt_closure_err[i], color=colors[i])
ax[1].set(xlabel='X bins', ylabel='unfolded / truth')
fig.legend()
plt.show()
It does converge faster in that case which improves the unfolding uncertainties.
Uncertainties#
Statistical uncertainties come from:
uncertainties in the data
limited size of the MC sample In both cases it is usually the case that bootstrapping the statistical uncertainties is the most robust method of determining them.
Systematic uncertainties largely come from the choice of the prior distribution. The choice of how to handle this depends on how much the unfolding is doing (how different are MC_reco and MC_true?) and what is known about the prior.
How to set the stopping condition?#
There is a trade off. The statistical uncertainties (correlated between the unfolded points) grow with every iteration, but unfolding further does allow more time to reach a stable working point. Usually, one takes a minimum in: $\(\sum_{bins} \left(\sigma^2_{stat} + \sigma^2_{conv}\right)\)\( where \)\sigma_{stat}\( is statistical uncertainty from a bin and \)\sigma_{conv}\( is the difference between the unfolded results for \)n\( and \)n-1$ iterations.
In another way of doing it, Pyunfold implements a variety of test statistics to determine the stopping condition.
One wants to iterate as few times as necessary because over iterating causes large fluctuations/uncertainties.
What we’ve done so far is set up a nice example but we have been unfolding our MC_reco sample with a response matrix generated from MC_reco and MC_true. No where has anything like a real measurement come in.
Here we’ll go back to the data_true sample we generated above and generate data_reco. This sample was generated independently from MC_true and so this is the situation that one would have with a real measurement scenario. We will also generate an independent noise sample (but with the same parameters as above).
random_noise2 = np.random.normal(loc=noise_shift, scale=noise_width, size=num_samples)
data_reco = data_true * random_noise2
data_reco_hist, _ = np.histogram(data_reco, bins=bins)
data_reco_err_hist = np.sqrt(data_reco_hist)
data_unfolded_results = []
data_closure = []
data_closure_err = []
for i in range(len(ts_values)):
tmp_results = iterative_unfold(data=data_reco_hist,
data_err=data_reco_err_hist,
response=response,
response_err=response_err,
efficiencies=efficiencies,
ts_stopping=ts_values[i],
prior=alt_prior_hist,
#ts='rmd',
efficiencies_err=efficiencies_err,
callbacks=[Logger()])
data_unfolded_results.append(tmp_results)
tmp_closure = tmp_results['unfolded'] / data_true_hist
tmp_closure_err = tmp_results['sys_err'] / data_true_hist
data_closure.append(tmp_closure)
data_closure_err.append(tmp_closure_err)
colors = cm.rainbow(np.linspace(0, 1, len(ts_values)))
print(type(colors))
fig, ax = plt.subplots(1,2,figsize=(15, 10))
ax[0].step(np.arange(num_bins), data_true_hist, where='mid', lw=3,
alpha=0.7, label='data True distribution')
ax[0].step(np.arange(num_bins), data_reco_hist, where='mid', lw=3,
alpha=0.7, label='data reco distribution')
for i in range(len(ts_values)):
labelwords = 'unfolded, stop: ' + str(ts_values[i])
ax[0].errorbar(np.arange(num_bins), data_unfolded_results[i]['unfolded'],
yerr=data_unfolded_results[i]['sys_err'],
alpha=0.7,
elinewidth=3,
capsize=4,
color=colors[i],
ls='None', marker='.', ms=10,
label=labelwords)
ax[0].set(xlabel='X bins', ylabel='Counts')
ax[1].errorbar(np.arange(num_bins),data_closure[i], yerr=data_closure_err[i], color=colors[i])
ax[1].set(xlabel='X bins', ylabel='unfolded / truth')
fig.legend()
plt.show()
More than one dimension#
It’s straightforward to conceptualize a single dimensional unfolding. The same procedure works in multiple dimensions, though it becomes more difficult to visualize and understand the uncertainty on a multi-dimensional unfolding.
Convolution#
A convolution is an operation that takes place on two functions, \(f\) and \(g\), producing a third function \(f * g\): $\((f * g)(t) = \int_{-\infty}^{\infty} f(\tau) g(t-\tau) d\tau\)$ Intuitively, this involves flipping one of the functions and sliding it past the other. The value of the convlution at a particular point is then the integral of the overlap of the two functions.
(image from wikipedia)
There are a few properties of convolutions that are useful to know:
\((f * g) = (g *f)\), communtativity
\(f * (g * h) = (f * g) * h\), associativity
\(f * (g + h) = f * g + f * h\), distributivity
\( f * \delta = f\)
Convolutions show up in lots of places, particularly signal processing, responses, physics, etc. This is what is going on in kernel density estimation as well (from earlier in the semester).
Of course there are multiple implementations of convolutions in python (we’ll use the scipy one) but it is easy to lose intuition of what’s going on without working through some simple examples.
Here’s a simple way to think abou a discrete convolution. Let’s say that we have \(f = [1 2 3 4]\) and \(g = [1 2]\). In order to evaluate the convolution, we flip \(g\) and slide it across \(f\). At each step, we take the product of the element of \(f\) and the element of \(g\) that are lined up. Those elements are summed together to create an element of the convolution.
Let’s verify that this is the same as one gets using scipy’s convolution function…
from scipy import signal
sig = ([1,2,3,4])
win = ([1,2])
convolution = signal.convolve(sig, win, mode='full')
print(convolution)
Now we can more over to more complicated functions. Here is a \(\delta\) function convoluted with a Gaussian.
sig = np.repeat([0],100)
sig = np.append(sig,np.repeat([1],1))
sig = np.append(sig,np.repeat([0],100))
win = signal.windows.gaussian(60,10,)
filtered = signal.convolve(sig, win, mode='same') / sum(win)
fig, (ax_orig, ax_win, ax_filt) = plt.subplots(3, 1, sharex=True)
ax_orig.plot(sig)
ax_orig.set_title('Original pulse')
ax_orig.margins(0, 0.1)
ax_win.plot(win)
ax_win.set_title('Filter impulse response')
ax_win.margins(0, 0.1)
ax_filt.plot(filtered)
ax_filt.set_title('Filtered signal')
ax_filt.margins(0, 0.1)
fig.tight_layout()
Broadening this out to a wider flat distribution…
sig = np.repeat([0],100)
sig = np.append(sig,np.repeat([1],10))
sig = np.append(sig,np.repeat([0],100))
win = signal.windows.gaussian(60,10,)
filtered = signal.convolve(sig, win, mode='same') / sum(win)
fig, (ax_orig, ax_win, ax_filt) = plt.subplots(3, 1, sharex=True)
ax_orig.plot(sig)
ax_orig.set_title('Original pulse')
ax_orig.margins(0, 0.1)
ax_win.plot(win)
ax_win.set_title('Filter impulse response')
ax_win.margins(0, 0.1)
ax_filt.plot(filtered)
ax_filt.set_title('Filtered signal')
ax_filt.margins(0, 0.1)
fig.tight_layout()
sig = np.repeat([0., 1., 0.], 100)
win = signal.windows.gaussian(60,10,)
filtered = signal.convolve(sig, win, mode='same') / sum(win)
fig, (ax_orig, ax_win, ax_filt) = plt.subplots(3, 1, sharex=True)
ax_orig.plot(sig)
ax_orig.set_title('Original pulse')
ax_orig.margins(0, 0.1)
ax_win.plot(win)
ax_win.set_title('Filter impulse response')
ax_win.margins(0, 0.1)
ax_filt.plot(filtered)
ax_filt.set_title('Filtered signal')
ax_filt.margins(0, 0.1)
fig.tight_layout()
fig.show()
It’s worth knowing that a Gaussian convoluted with a Gaussian remains a Gaussian.
import scipy.stats as stats
x = np.linspace(-4, 4, 100)
mean = 0
std_dev = 1
sig = stats.norm.pdf(x, mean, std_dev)
win = signal.windows.gaussian(20,10,)
filtered = signal.convolve(sig, win, mode='same') / sum(win)
fig, (ax_orig, ax_win, ax_filt) = plt.subplots(3, 1, sharex=True)
ax_orig.plot(sig)
ax_orig.set_title('Original pulse')
ax_orig.margins(0, 0.1)
ax_win.plot(win)
ax_win.set_title('Filter impulse response')
ax_win.margins(0, 0.1)
ax_filt.plot(filtered)
ax_filt.set_title('Filtered signal')
ax_filt.margins(0, 0.1)
fig.tight_layout()
fig.show()
We saw that convolving something with sharp edges with a Gaussian smooths out the edges, as we expect. One use for this is in image processing. Here we’ll create a 2D Gaussian and blur an image.
import os.path
import subprocess
def wget_data(url):
local_path='./'
subprocess.run(["wget", "-nc", "-P", local_path, url])
def locate_data(name, check_exists=True):
local_path='./'
path = os.path.join(local_path, name)
if check_exists and not os.path.exists(path):
raise RuxntimeError('No such data file: {}'.format(path))
return path
wget_data('https://raw.githubusercontent.com/illinois-dap/DataAnalysisForPhysicists/refs/heads/main/data/belt.png')
locate_data('belt.png')
img = plt.imread("./belt.png")
plt.figure()
plt.imshow(img)
t = np.linspace(-10, 10, 60)
sigma2 = 20
bump = np.exp(- t**2 /sigma2)
bump /= np.sum(bump)
kernel = bump[:, np.newaxis] * bump[np.newaxis, :]
fig, ax = plt.subplots(1,2,figsize=(10, 5))
ax[0].step(t, bump, where='mid', lw=3,
alpha=0.7)
ax[0].set(xlabel='x', ylabel='amplitude')
ax[1].imshow(kernel)
plt.legend(loc='lower center')
plt.show()
#print(kernel)
img3 = sp.signal.convolve(img, kernel[:, :, np.newaxis], mode="same")
plt.figure()
plt.imshow(img3)