Skip to article frontmatterSkip to article content
Site not loading correctly?

This may be due to an incorrect BASE_URL configuration. See the MyST Documentation for reference.

3. Prior information and Bayesian inverse problems

How do we supply prior information about a solution to an inverse problem, in order to make it well defined and stable with respect to data errors? The key idea is that we must identify important properties of the desired solution, and then impose these properties when solving the problem.

One way to impose such prior information is to use a method called regularization [H10, N98], where we minimize a linear combination of a data-fitting term and a regularization term that penalizes solutions that do not adhere to the prior information. As an example we mention Tikhonov regularization, which some readers may already be familiar with, and which takes the form

minxAxb22+λ2x22 .\min_x \| A\, x - b \|_2^2 + \lambda^2 \, \| x \|_2^2 \ .

Here, λ\lambda is a parameter that balances the weight given to the two objectives of fitting the data and penalizing a large norm. Assuming again iid Gaussian noise in the data vector bb, the regularized solution follows a Gaussian distribution with covariance matrix Σ=σ2(ATA+λ2I)1\Sigma = \sigma^2 \bigl( A^TA + \lambda^2 I\bigr)^{-1}. The flexibility of regularization methods lies in the fact that we can often find regularization terms (such as x22\| x \|_2^2) that correspond to our prior information.

A different approach to solving inverse problem - which is the one that underlies uncertainty quantification and CUQIpy - is to formulate the inverse problem in a statistical framework in the form of a Bayesian inverse problem. Here, both the data bb and the solution xx are considered to be a random variables. Unlike traditional regularization methods that seek a single solution, the Bayesian approach treats the inverse problem as a statistical question, producing a full probability distribution for the solution xx. This is called the posterior distribution and it accounts for observational noise, uses prior information about the solution, and provides a complete quantification of uncertainty in the solution. See, e.g., [CS] for details.

To understand Bayesian inverse problems and the way they are formulated and used in CUQIpy, we need to introduce four important statistical tools such that they are readily available in the rest of the book. See, e.g., [CS, Chapter1] for more details. For simplicity we consider the linear forward model b=Axb = A\, x where AA is a matrix, bb is the measured data, and xx is the solution.

Data distribution π(bx)\pi(b | x). This is a statistical distribution that specifies how likely it is to observe the data bb given the parameter or parameter-vector xx. A key point is that we assume the noisy data can be explained by a forward model (here b=Axb = A\, x), and therefore the data distribution is conditioned in the variable xx. The second key point is that we assume a statistical description of the noise, and this assumption informs the specific expression for π(bx)\pi(b | x).

Likelihood function L(xb=bobs)L(x | b=b_{\mathrm{obs}}). We obtain the likelihood function from the data distribution by fixing the variable bb to be the observed data bobsb_{\mathrm{obs}}. Hence, the likelihood function quantifies the probability of observing the noisy data bobsb_\text{obs} for each possible value of xx, according to the predictions made by the forward model. As an example, for our linear forward model and with iid Gaussian noise we have

L(xb=bobs)exp(Axbobs222σ2) .L(x | b=b_{\mathrm{obs}}) \propto \exp\biggl( - \frac{\|A\, x-b_{\mathrm{obs}}\|_2^2}{2\,\sigma^2} \biggr)\ .

Prior π(x).\pi(x). This is a distribution function that expresses our knowledge, or belief, about the solution xx to the inverse problem. In Bayesian inverse problems it is used to enforce that we prefer solutions that adhere to our prior knowledge, and it plays the same role as the regularization term in classical regularization methods. The prior may be determined from past information or previous experiments, or it can express a subjective assessment of a domain expert [Rob, Chapter 3].

Posterior π(xbobs).\pi(x | b_{\mathrm{obs}}). This function expresses the probability of xx according to the measured data as well as our prior information. Hence, it is conditioned on bobsb_{\mathrm{obs}} and it expresses what is known about the solution after the data has been collected and the prior is imposed. The aim of a Bayesian inverse problem is to characterize and compute this probability function.

We note that some presentations of Bayesian inverse problems skip the introduction of the data distribution π(bx)\pi(b | x). We include it in CUQIpy and in this book, because we believe this makes it easier and more intuitive to formulate and understand how the forward model and the noise model enter the likelihood function.

Given these important tools, Bayes’ theorem from statistics (see, e.g., [CS, \S 1.2.2]) gives the following expression for the posterior:

π(xbobs)=L(xb=bobs)π(x)π(b) .\pi(x|b_{\mathrm{obs}}) = \frac{L(x | b=b_{\mathrm{obs}}) \, \pi(x)}{\pi(b)} \ .

Here, the denominator (called the marginal probability or marginal distribution) is a constant with respect to xx; it can always be determined such that the fraction integrates to 1. Throughout, we will therefore write

π(xbobs)L(xb=bobs)π(x) .\pi(x|b_{\mathrm{obs}}) \propto L(x | b=b_{\mathrm{obs}}) \, \pi(x) \ .

The important lesson here is that in order to quantify the uncertainties in xx in a Bayesian inverse problem, we need to specify the data distribution (i.e., the forward model and the distribution of the noise) as well as the prior for the solution. A key point here is that the prior, when properly chosen, helps take care of the violation of the Hadamard conditions in an ill-posed problem. In the following chapters we have much more to say about all this.

In Bayesian inverse problems and uncertainty quantification the goal is thus to determined the posterior distribution for xx. We need to know its “shape” and characteristics because it tells us everything we want to know about the random variable xx. For example, we can produce point estimates of xx, and we can compute variances, covariances, and higher-order moments.