Open access peer-reviewed chapter

# Survey of Computational Methods for Inverse Problems

By Sergey Voronin and Christophe Zaroli

Submitted: August 14th 2017Reviewed: December 21st 2017Published: May 30th 2018

DOI: 10.5772/intechopen.73332

## Abstract

Inverse problems occur in a wide range of scientific applications, such as in the fields of signal processing, medical imaging, or geophysics. This work aims to present to the field practitioners, in an accessible and concise way, several established and newer cutting-edge computational methods used in the field of inverse problems—and when and how these techniques should be employed.

### Keywords

• inverse problems
• matrix factorizations
• regularization
• parameter estimation
• model appraisal
• seismic tomography

## 1. Introduction

In this work, we aim to survey several techniques useful to a practitioner in the field of inverse problems, where the solution to a vector of interest is given through a linear system Ax=bor through a set of nonlinear equations Fx=0. In our presentation below, we review both classical results and newer approaches, which the reader may not be familiar with. In particular, this chapter offers entries on the following material:

• Matrix factorizations and sparse matrices

• Direct solves and pivoted factorizations

• Least squares problems and regularization

• Nonlinear least squares problems

• Low-rank matrix factorizations and randomized algorithms

• An introduction to Backus-Gilbert inversion

## 2. Notation

In this chapter, we use the norm to refer to the spectral or operator norm, and pto refer to the pnorm. We make frequent use of the QR decomposition and the SVD (singular value decomposition). For any M×Nmatrix A and (ordered) subindex sets Jr and Jc, A(Jr, Jc) denotes the submatrix of A obtained by extracting the rows and columns of A indexed by Jr and Jc, respectively; and A:Jcdenotes the submatrix of A obtained by extracting the columns of A indexed by Jc. We will make use of the covariance and the variance matrix, which we define as follows in terms of the expected value:

covxy=ExExyEyT]andvarx=covxx.

By “Diag,” we refer to a diagonal matrix with nonzeros only on the diagonal. We make use of the so-called GIID matrices. These are matrices with independent and identically distributed draws from the Gaussian distribution. In the Octave environment (which we frequently reference), these can be obtained with the “randn” command. We assume the use of real matrices, although most techniques we describe extend to the complex case.

## 3. Matrix factorizations and sparse matrices

Let A be an M×Nmatrix with real or complex entries, and set r=minMN. We will make use of the singular value decomposition (SVD) of a real matrix ARM×N: if A is of rank r, then there exist URm×r, VRN×rand Rr×rsuch that

1. UTU=I, VTV=I,

2. =Diagσ1σ2σrRr×ris a diagonal matrix with σ1σ2σr>0, and

3. A=UΣVT.

This is known as the economic form of the SVD [1]. For 1iminMN, the i-th largest singular value of A is defined to be σi, with σj=0for j=r+1,,minMNwhenever r<minMN. The generalized inverse of ARm×Nwith SVDA=UΣVT, is defined as A+=VΣ1UT(and 1=Diagσ11σ21σr1Rr×r). By a rank deficient matrix, we imply a nonlinear decay of singular values σi. In this case, the numerical rank of A may be smaller than the actual rank due to the use of finite precision arithmetic.

The (compact) QR-factorization of A takes the form

AP=QR,m×nn×nm×rr×nE1

where P is a permutation matrix, Q has orthonormal columns, and R is upper triangular. The permutation matrix P can more efficiently be represented via a vector JcZ+nof indices such that P=I:Jcwhere I is the n×nidentity matrix. The factorization (1) can then be written as:

A:Jc=QRm×nm×rr×nE2

Another commonly used decomposition is the pivoted LU:

A:Jc=LU.m×nm×rr×nE3

with L a lower triangular and U an upper triangular matrix. In the Octave environment, these decompositions can be constructed, respectively, via the commands (QRI=qrA0;LUI=luA. The matrix P does not need to be explicitly formed. Instead, vector I gives the permutation information. The relation between the two in Octave is given by the command P:I=eyelengthI.

Many matrices in applications turn out to be sparse. They can be stored more efficiently, without the need to store all m×nelements. The simplest sparse format is the so called coordinate sparse format, common to, e.g., the Octave environment. In this format, we store the integers row, column, and floating point value for each nonzero of the sparse matrix A: a set of triplets of the form ijv. However, we do not need to store all the row and column indices of the nonzero elements. Below, we summarize the two commonly used sparse formats for an example matrix.

A=063010807002

The compressed column and row formats for this matrix are given by the vectors:

ic=120012,pc=02356,dc=176382,

and

ir=210230,pr=0246,dr=361827.

In the compressed column format, the dc array stores the nonzero elements, scanned row by row. The array ic stores the row index of the corresponding data element, and the array pc stores the index of the start of each column in the data array dc. Similarly, for the compressed row format, all the column indices of nonzeros are given, but the row information is compressed by giving in pr, the index of the start of each row in dr. Moreover, if needed, the three vectors for the sparse representation above can be further compressed, with, e.g., lossless compression techniques, such as arithmetic coding [2]. BLAS operations on sparse matrices can be performed directly using these storage formats. Below, we list the pseudocode for the operations y1=Ax1and y2=ATx2for a m×nsparse matrix A stored with compressed row format.

function y = mat_mult (A, x)function y = mat_trans_mult (A, x)
y = zeros (m, 1); y = zeros (n, 1);
for i = 1:mfor i = 1:m
for j = pr (i): pr (i + 1)  for j = pr (i): pr (i + 1)
y (i) = y (i) + dr (j) * × (i r (j));    y (i r (j)) = y (i r (j)) + dr (j) * × (i);
end  end
endend
endend

## 4. Direct solves

Given a linear system Ax=bwith a square matrix A which is invertible detA0, the solution x can be constructed through the inverse of A, built up using Gaussian elimination. For relatively small systems, the construction of such solutions is often desired over least squares formulations, when a solution is known to exist. Typically elimination is used to construct the factorization of A into a QR or LU decomposition. The construction of factorizations (QR, LU) with column pivoting can be applied to system solves involving rank deficient matrices. As an example, consider the pivoted QR factorization AP=QR. Here AP=A:I, a rearrangement of the columns of A upon multiplication with permutation matrix P. Plugging into Ax=byields QRPTx=bQRy=bRy=QTb, which is an upper triangular system, and can be solved by back substitution. A simple permutation PTx=yx=Pyyields the solution x.

Similarly, suppose we have the pivoted LU factorization AP=LU. Then, plugging into Ax=byields LUPTx=b. Next, set z=UPTx=Uywith y=PTx. Then Lz=b(which is a lower triangular system) can be solved by forward substitution for z, while Uy=zcan be solved by back substitution for y. Again applying a permutation matrix to x=Pyyields the result. Notice that multiplying by P can be done efficiently, simply be re-arranging the elements of y. The implementations of the back substitution and forward substitution algorithms are given below.

% Solve Lz = b% Solve Uy = z
function z = fwd_sub (L, b)function y = back_sub (U, z)
n = length (b); n = length (z);
z = zeros (n, 1); y = zeros (n, 1);
for i = 1:nfor i = n: −1:1
z (i) = (b (i) − L(i,:) * z) /L(i, i);  y (i) = (z (i) − U(i,:) y) /U(i, i);
endend
endend

## 5. Regularization

### 5.1. Least squares

Prior to discussing two-norm or what is more commonly known as Tikhonov regularization, we mention the least squares problem:

x¯lsq=argminxAxb22E4

This formulation arises due to the noise in the right hand side b, in which case it does not make sense to attempt to solve the system Ax=bdirectly. Instead, if we have an estimate for the noise norm (that is, b=b¯+ewith unknown noise vector e, but we can estimate e2), then we could seek a solution x such that Axb2e2. Let us now look at the solution of (4) in more detail. As the minimization problem is convex and the functional quadratic, we obtain the minimum by setting the gradient of the functional to zero:

xAxb22=0ATAx=ATbE5

A common choice of solution to the quadratic Eq. (5) would be directly through the generalized inverse:

x=ATA+ATb=VΣ2VTVΣUTb=A+b,E6

because of all the solutions to ATAx=ATb,A+bhas the smallest 2-norm: ATAx=ATbif and only if x=A+b+dfor some dkerATA=rangeATA=rangeA+, and

A+b+d2=A+b2+2dTA+b+d2=A+b2+d2A+b2.

Typically, the least squares problem is solved by an iterative method such as conjugate gradients (CG) or related techniques such as LSQR. Whichever way the solution to the normal equations in (5) is obtained, it will be close A+b and share its properties.

First, if A has small singular values, the norm of the solution A+b=VΣ1UTbwill be very large because of the Σ1matrix. Another disadvantage of (4) is that the solution A+b will be very sensitive to any noise in b or even in A (if any approximations in the calculations are used). Suppose the noise vector e behaves like white noise. Its different entries are uncorrelated, each having mean 0 and standard deviation ν. If, in addition, the elements of b¯and e are uncorrelated, we have:

vare=EeEeeEeT=EeeT=ν2I,

and

varb=EbEbbEbT=EeeT=ν2I.

We may then estimate the norm of the variance of the solution vector:

varx=varA+b=varVΣ1UTb=VΣ1UTvarbUΣ1TVT=ν2VΣ2VTvarx2=ν2VΣ2VT2=ν2σmin2,

where σminis the smallest magnitude singular value of A. We can clearly see that when A has an appreciable decay of singular values (such that σminis small relative to σ1), the solution x=A+bwill be sensitive to data errors. For these reasons, adding additional terms (regularization) to the optimization problem is often necessary.

### 5.2. Tikhonov regularization

Having discussed the least squares approach we turn our attention to the simplest form of Tikhonov Regularization:

x¯tik=argminxAxb22+λx22E7

where λ > 0 is a scalar regularization parameter, which controls the tradeoff between the solution norm x2and the residual fit Axb2. Since the functional in brackets is convex, we can get the solution by again setting the gradient to zero:

xAxb22+λx22=02ATAx¯b+2λx¯=0ATA+λIx¯=ATbE8

If we plug in the SVD of A=UΣVT, we get:

x¯=UΣVTTUΣVT+λI1ATb=VΣTΣ+λI1VTVΣTUTb=VΣTΣ+λI1ΣTUTb=VDiagσiσi2+λUTb=VDUTb

We see that the effect of the regularization is to filter the small singular values σi, by replacing each σiby σiσi2+λ, which prevents the singular values smaller than λfrom dominating the solution. If we now compute the norm of the solution variance, we obtain:

varx¯=varVDUTb=VDUTvarbUDVT=v2VD2VTvarx¯2=v2VD2VT2=v2D22v24λ.

The result follows because the function httt2+λsatisfies ht=0at t=±λwith hλ=12λ. Thus, the solution to the system from (8) (even if obtained from a CG type scheme after a finite number of iterations) relieves the problems due to small singular values and noise, which affect the solution from (5).

In practice, a slight generalization of (7) is performed by adding a regularization matrix L:

x¯tikL=argminxAxb22+λ1x22+λ2Lx22E9

Generally, L is some kind of sharpening (difference) operator. The penalization of the term Lx, thus results in smoothing. If the coordinate system x is one-dimensional, it could take the form:

L1=110001000011

When the coordinate system corresponding to the model vector x is higher dimensional, L will be correspondingly more complicated and will generally have block structure. Note that the solution to (9) can be obtained through the linear equations:

ATA+λ1I+λ2LTLx¯=ATbE10

and can also be cast as an augmented least squares system and solved through its corresponding augmented normal equations:

x¯=argminxAλ1Iλ2Lxb0022Aλ1Iλ2LTAλ1Iλ2Lx¯=Aλ1Iλ2LTb00

This is an important point with regards to implementation, as it means that codes which solve the normal equations for standard least squares can be readily modified to solve (9). If L is a smoothing operator, the parameters λ1and λ2effect the degree of norm damping and model smoothing, respectively. Notice that increasing λ2from zero also changes the solution norm, so λ1may need to be altered to compensate.

### 5.3. Sparse regularization and generalized functional

The 2penalty in (7) has the effect of penalizing the solution norm in a way which encourages all coefficients to be small (as λ is increased). For very large λ, only x = 0 would result in the required minimum but for modest values (below, e.g., ATb), the effect would be to force all solution coefficients to have smaller magnitudes with increasing λ, but would not make any of them explicitly zero. In many applications, a sparse solution is sought (where, a significant percentage of the coefficients of x are zero). A so-called 0measure is an indicator function for the number of nonzeros of x. This measure is not a norm, as it does not satisfy, e.g., the basic triangle inequality. Constraining the 0measure (e.g., Axb<ϵ, minx0), leads to a combinatorially difficult optimization problem.

A key insight of compressive sensing is that the minimization with respect to the 0measure and the convex 1norm (the sum of the absolute values of the components of a vector) produce the same result (the sparsest solution) under some strong conditions (i.e., RIP, restricted isometry property) on A [3]. In practice, a nonrandom A from a physical application would not satisfy the RIP conditions. On the other hand, the minimization with respect to the 1norm (and the p-norms for 0 < p < 2, convex only for p1) produce sparse solutions (but not-necessarily the sparsest one at a given residual level). Sample illustrations for the 2d case are given in Figure 2, where we can observe that in two dimensions, the minimization of the pnorm for p1results in one of the two components equal to zero. To account for the possibility of employing a sparse promoting penalty and also for more general treatment of the residual term, which we discuss more below, we will consider the two-parameter functional [4]:

Fl,px=Axbll+λxpp=i=1mj=1mAijxjbil+λi=1nxip,E11

with F˜p=F2,p(with l = 2 being the most-common residual-based penalty). For p < 2, the functional F˜pis not differentiable. This means that the minimum value cannot be obtained by setting the gradient equal to zero as for 2-based minimization. A particularly well-studied example is the 1case, which is the closest convex problem to the 0penalty. Convexity of F˜1xguarantees that any local minimizer is necessarily a global one. In this case, an algorithm can be constructed which decreases the functional value and tends to the (global) minimizer of F˜1x. One such method is called the iterative soft thresholding algorithm (ISTA) and relies on the soft thresholding function Sτx, defined as:

Sτxk=sgnxkmax0xkτ,k=1,,n,xRn.

The benefit of this function is two-fold: it explicitly sets small components of x to zero (promoting sparsity) and is continuous (unlike the related hard thresholding function which simply zeros out all components smaller than τ in absolute value). The soft thresholding function satisfies a useful identity:

Sτb=argminxxb22+2τx1,

which is utilized with a surrogate functional approach to construct the ISTA scheme:

xn+1=Sτxn+ATbATAxn

This algorithm converges to the 2minimizer for any initial guess and with A2(the spectral norm of A) being less than 1 (which can be accomplished simply by rescaling). The spectral norm of a matrix Acan easily be estimated using so called power iteration [1]. Let us assume that Ais square. If it is not we can take the matrix ATAin it’s place and take the square root of the eigenvalue found to be the estimate for the spectral norm. If we take a vector x0and write it as a linear combination of eigenvectors of A, then x0=α1v1++αnvn. It follows that the iterative computation xm=Axm1yields (plugging in Avk=λkvk):

α1λ1mv1++αnλnmvn=λ1m[α1v1+α2(λ2λ1)mv2++λn(λnλ1)mvn]limmxmλ1m=α1v1,

a scalar multiple of the dominant eigenvector. A simple computation yields the dominant eigenvalue. In practice, a much faster converging scheme called FISTA (Fast ISTA) [5] is utilized, which is a slight reformulation of ISTA applied to a linear combination of two previous iterates:

xn+1=Sτyn+ATbATAyn,yn=xn+tn11tnxnxn1,tn+1=1+14tn22,E12

where t1 = 1. This algorithm is simple to implement and readily adapts to parallel architectures. For challenging problems (such as when the decay of singular values of A is rapid), the thresholding function Sτcan be slightly altered (see Figure 3), but the same iterative scheme (12) can be utilized. Two possible approaches are either to vary the thresholding, starting from soft thresholding and slowly approaching the (discontinuous) hard thresholding function, or to use a function which better mimics hard thresholding away from zero. The use of different thresholding functions alters the optimization problem being solved. Thresholding-based techniques are simple to implement but are not effective in all situations, particularly when only a few iterations are feasible (for example, when A is large). In this case, two interesting approaches are iteratively re-weighted least squares (IRLS) and convolution smoothing [4]. Both techniques replace the nonsmooth part of the functional (namely, the absolute value function |x|) by a smooth approximation. Moreover, both techniques have the particular advantage of being able to employ gradient-based methods (such as CG) at each iteration, considerably increasing the per-iteration performance. The IRLS approach is based on the approximation:

xk=xk2xk=xk2xk2xk2xk2+ϵ2

where in the rightmost term, a small ϵ0is used, to insure the denominator is finite, regardless of the value of xk. The resulting algorithm [4] for the minimization of (11) can be written as:

ATRnA+DnTDnxn+1=ATRnb

with two diagonal iteration dependent matrices Dn and Rn. The diagonal matrix Dn has elements 12λpwknand Rn has diagonal elements lrinl2(for i where rin<ϵ, we can set the entry to lϵl2with the choice of ϵ user controllable, tuned for a given application). Here, the residuals rin=Axnbiand the iteration dependent weights are given by:

wkn=1xkn2+ϵn22p2.

The diagonal matrices (or simply the vectors holding the diagonal elements) are updated at each iteration and the system in (6.10) can be solved approximately via a few iterations of CG- or LSQR-based algorithms. Another advantage of the IRLS approach is that the powers p can be made component dependent. This then allows for better inversion of partially sparse signals (if of course, the location of the sparse part can be estimated with some accuracy). An example is illustrated in Figure 4 and further discussed in [8]. Another approach discussed in [4] is based on a smooth approximation of the absolute value function f(t) = |t| obtained via convolution with a sequence of Gaussian kernels, which have approximately shrinking support. The resulting “conv-CG” method is suitable especially for rapid warm start acceleration.

### 5.4. Alternate penalty functions and regularization parameter selection

We mention here the classical image deconvolution problem. Given a blurring source g, such as a 2D Gaussian function, we can produce a blurry image s from an unaltered source f via convolution s=fg+n, where n is some additive noise component. For such situations, a TV (total variation) norm penalty is frequently used, for purposes of noise removal [9]. For a 2-D signal (such as an image), the TV penalty can be written as Vs=i,jsi+1jsi,j2+si,j+1si,j2. Sometimes, the alternate approximation i,jsi+1,jsi,j+si,j+1si,jis utilized. Various iterative schemes have been developed for such penalty functions [9].

Both the 2-based approaches and sparsity promoting regularization schemes (as well as TV-norm penalty functionals) utilize one or more regularization parameters. In the case of Tikhonov regularization with smoothing (as in (9)) more than one parameter is present. In this case, the second (smoothing) parameter can generally be set according to the desired smoothing effect, once the first parameter λ1is chosen (with a fixed value of λ2) and then, λ1can be adjusted to achieve a desired solution norm. Thus, we focus here on techniques to adjust λ1, which we simply refer to as λ.

The standard way to choose the parameter is to use the L-curve technique starting at a large λ(generally a value close to ATbis a good choice) and decreasing down in logarithmic fashion using the logarithmically spaced sequence:

S=logλmaxlogλminN1;λi=explogλmaxSi1,i=1,,N.

The parameter N can vary by application but is typically in the range [5, 10]. Two typical Strategies for parameter selection are employed.

The first is based on a target residual value, typically determined by the estimate of the noise norm. At every λafter the initial value we reuse the previous solution as the initial guess for the CG scheme at the current λ. We can use the solution for which Axλbis closest to the desired residual level (or refine further the solution at this λwith more CG iterations).

If however, the target residual norm is not available, other techniques must be used. We discuss a method using the so-called L-curve where for the norm damping problem (7), we plot a curve composed of points logAxλblogxλwhich we can obtain using the same continuation procedure previously discussed. The curve represents the tradeoff between the residual value Axλband the solution norm xλ. In practice, neither of these quantities should dominate over the other. Hence, an established strategy is to look for the point of maximum curvature along the L-curve [11]. If we set:

ϵ¯=logxλpandρ¯=logAxλbl,E13

where is the solution of (11) at the particular value of λ. We can then compute the curvature by the formula:

c¯λ=2ρ¯ϵ¯ρ¯ϵ¯ρ¯2+ϵ¯232,E14

where the derivative quantities can be approximated via finite differences. We illustrate various plots for a synthetic example in Figure 5. In the residual plot, the target residual is taken to be the magnitude of the noise vector norm. We can also see that the lowest percent error between x¯λand the true x occurs at a value of λ roughly corresponding to the highest curvature of the L-curve. In fact, for this example, the curvature method gives a better estimate of good λ than the residual curve technique.

## 6. Nonlinear least squares (NLS) problems

In many cases, the inverse problem may be posed in terms of a nonlinear function F (x, t) with x a vector of variables, which may be time dependent with parameter t. We first describe, here, the popular Newton-Gauss method for NLS [1]. Let gx=12rx2with rix=yiFxti. Then, the NLS problem takes the form: x¯=argminxgx. Setting gx=0, yields with Newton’s method:

xn+1=xn2gxn1gxn

Expanding the gradient and Hessian of g yields:

gx=i=1mrixrix=JTrx whereJ=Jrx2gx=i=1mrixrixT+i=1mrix2rix=JTJ+TxJTJ.

where Tx=i=1mrix2rixandJrxi:=rixT=FxtiT.The approximation Tx0is used in the expression for the Hessian in the Gauss-Newton method, yielding a simple iterative scheme:

xn+1=xnJnTJn1JnTrn.

Unfortunately, this method is not stable and will typically not converge if initialized far away from a minimum solution [1]. Improvements include the introduction of a step size parameter:

xn+1=xnαnJnTJn1JnTrnαn=argminαgxnαsnwithJnTJnsn=JnTrn.

and of the use of a regularizer (e.g., Levenberg-Marquardt method [1]): where the system JnTJny=JnTrnis replaced by an 2-norm penalty regularized system, JnTJn+λIy˜=JnTrn.

## 7. Low-rank matrix factorizations

In many applications, there are large matrices with rapidly decaying singular values. In such cases, low-rank matrix approximations like the low-rank SVD are useful for compression, speed gains, and data analysis purposes. For ARm×n, the low-rank SVD of rank k (with k < min(m, n)) is the optimal matrix approximation of A in the spectral and Frobenius norms. Taking p = min(m, n), we define the low-rank SVD of rank k by Ak by taking into account only the first k < p singular values and vectors: that is, with UkRm×kconsisting of the first k columns of U,Σk=Diagσ1σkRk×kconsisting of k rows and columns of Σ, and VkRn×kconsisting of the first k columns of V:

Ak=j=1kσjujvjT=UkkVkT,E15
Uk=u1u2uk,Vk=v1v2vk,and=σ10000σ20000σ300000σk

By the Eckart-Young theorem [12]:

AAk=σk+1,

when the error is measured in the spectral norm, and

AAkF=j=k+1pσj21/2

in the Frobenius norm. When k ≪ p, the matrices Uk, Σkand Vkare significantly smaller (cost of storage of all nonzeros is mk + nk + k) than the corresponding full SVD matrices U, Σ, and V (cost of storage is mp + np + p) and that of A (cost of storage is mn, but only some fraction of this if A is sparse). While the construction of Ak is expensive (requiring in most cases the SVD of A), it can be approximated very accurately via randomized algorithms, which requires only the SVD of a smaller matrix. Various randomized algorithms for constructing the low-rank SVD and related factorizations are described in [13]. Techniques for computing the low-rank SVD of a matrix rely on a simple principal. An orthonormal matrix QRm×r(with r = k + l where l is a small oversampling parameter, e.g., l = 5), is computed such that QQTAA. If in fact r is large enough so that QQTA=A, the range of A is a subset of the range of Q. Thus, when QQTAA, we expect the range of Q to capture a good portion of the range of A, a statement which can be made rigorous with some analysis. In this case, we form the smaller matrix B=QTA, where BRr×n, possibly much smaller than the m × n matrix A. Instead of performing the SVD on A, we can obtain the SVD of B=UΣVT. If AQQTA=QB, then AQUΣVTand the later will form a low-rank SVD approximation for A (if we only take the first k singular vectors and values of the corresponding factorization). Notice that when A is rectangular, the eigen-decomposition of the BBT or BT B matrices can be used to construct the approximate low-rank approximation of A.

A separate problem is the construction of a suitable matrix Q from A. Again, the idea is to construct as small (in terms of column number) as possible Q with orthonormal columns, such that Q captures a good chunk of the range of A. When A is a matrix of known rank k then (in MATLAB notation), simply setting Q=qrAΩ0where ΩRn×k+lis a GIID matrix, with l a small over-sampling parameter, produces a valid matrix Q for projection. When the tail singular values (those smaller than σk) are still expected to be significant, a power sampling scheme turns out to be effective. Instead of setting Y = AΩ and performing QR of Y, we use the matrix Y=AATqAΩwith q ≥ 1. Plugging in the SVD of A, we obtain AATqA=UΣ2q+1VT, which has the same eigenvectors as A but much faster decaying singular values. Care must be taken when taking powers of matrices, to prevent multiplying matrices whose singular values are greater than one in magnitude. However, when the rank of the matrix A is not known, it is hard to use this approach, since the optimal size of the random matrix Ω to use would not be known. In this situation, a blocked algorithm can be employed [14], where on output with user supplied ϵ>0parameter, an orthonormal matrix Q and matrix B are produced such that QBA<ϵwhere B = QT A. Then, any number of standard low-rank matrix factorizations can be computed by operating on the matrix B instead of A. The basic steps of the proposed algorithm are given in Figure 6. We note that the resulting Q matrix can be utilized also for purposes of model reduction (e.g., one can use the reduced linear system QT Ax = QT b as an approximation to the full system Ax = b). That is, one can use the reduced linear system QTAx=QTbas an approximation to the full system Ax=b(or to replace Aand bwith the projected values in the least squares formulation); which has applications to e.g. accelerate image deblurring. The construction of Qfor large matrices can in practice be done in parallel by employing the algorithm in [13] over row blocks of the matrix, as illustrated in Figure 6.

The rank-k SVDAk=UkkVkTof a general m×nmatrix A yields an optimal approximation of rank k to A, both in the operator (spectral) and Frobenius norms. On the other hand, even if A is a sparse matrix, the m×kand n×kfactors Uk and Vk are typically dense. This means that if the matrix is approximately p percent filled, the matrix will have approximately N=p100m×nnonzeros. On the other hand, the rank k SVD will consist of approximately mk + k + nk nonzeros. For growing rank k, this quantity will quickly approach and even exceed N. Thus, even though the low-rank SVD is optimal for a given rank k, the choice of rank may be limited to relatively low values with respect to min(m, n) for sparse matrices, in order to achieve any useful compression ratios. (Of course, the usefulness of low-rank SVD representation is not simply limited to compression; indeed they are useful, e.g., for low-dimensional data projections; but the utility of a low-rank approximation is greatly reduced once the storage size of the factors exceeds that of the original matrix). Yet another aspect of the SVD which may be problematic is the difficulty in interpreting the eigenvectors present in Uk and Vk. While in many applications these have distinct meanings, they are not often easy to interpret for a particular data set.

It is thus plausible, in the above context, to look for factorizations which may not be optimal for rank k, but which may preserve useful properties of A such as sparsity and non-negativity, as well as allow easier interpretation of its components. Such properties may be found in the one- and two-sided interpolative decompositions and the CUR decomposition based on the pivoted QR decomposition. If we stop the QR procedure after the first k iterations, we obtain:

A:Jc=krkmQ1Q2×nkrkS1S2=Q1S1+Q2S2.E16
S1=knkkS11S12 and S2=knkk0S22E17
i.e.S=knkkrkS11S120S22E18
A:JC=Q1S11S12+Q20S22=mQ1S11kQ1S12+nkQ2S22.

From this formulation, we set

CA:JC1:k=Q1S11.Q1S1=Q1S11Q1S12=Q1S11IkS111S12=CIkTl,

where Tl is the solution to the matrix equation S11Tl = S12 (which if solved for Tl a column at a time, is simply a set of linear systems). It follows that we can write:

ACVT,whereVT=IkTlPT.E19

The one-sided ID of (rank k) is the approximate factorization:

AA:Jc1:kVT,m×nm×kk×nE20

where we use a partial column skeleton C=A:Jc1:kof a subset of the columns of A and V is a well-conditioned matrix. Clearly, C simply represents a subset of the columns of A chosen based on the pivoting strategy used in the QR factorization. The typical pivoting strategy is to choose the permutation matrix P (which simply dictates the re-arrangement of the columns of A) such that if S22 above is omitted, yielding:

APQ1S11S12+Q200=Q1S˜,

then the components of S˜satisfy s˜11s˜22s˜nm. Several other pivoting strategies can be employed and each will yield a somewhat different re-arrangement of the columns of A.

Once the single-sided ID is defined, the two-sided ID can be constructed simply by obtaining a one-sided ID of A and that of AT. A set of select columns of AT obtained by this procedure, will be the same as the set of select rows of A. Thus, we can write the two-sided ID of (rank k) as:

AWAJr1:kJc1:kVT,m×nm×kk×kk×nE21

The procedure for the construction of the interpolative decompositions can be accelerated by means of randomization, just like for the low-rank SVD. This is possible by virtue of the result below [13].

Lemma 1 Let Ω˜Rl×mbe a matrix with GIID entries. Then, for any aRm, we have that EΩ˜a2a2=land VarΩ˜a2a2=2l.

Suppose, A is m×nand we draw a l×mGIID matrix Ω˜. Suppose, we then form the l×nmatrix Z=Ω˜A. Then, EZ:j2A:j2=l. As the pivoting result depends heavily on the ratio of the individual column norms of A with respect to one another, the above result tells us that the ratio of column norms is roughly preserved in a matrix resulting from the multiplication of the original matrix by a Gaussian random matrix from the left. As the product matrix consists of fewer rows than the original matrix, the pivoted QR factorization is correspondingly cheaper to perform on the product matrix Z than on A, while the resulting permutation matrix (really the re-arrangement vector) will be similar for both cases.

The two-sided ID allows us to construct the popular Column/Row skeleton CUR (rank k) decomposition:

ACURm×nm×kk×kk×nE22

Suppose, we compute a two-sided rank k ID factorization forming the k×kcolumn/row skeleton A(Jr(1: k), Jc(1: k)). Set:

C=A:JC1:kandR=AJr1:k:

We then set this to equal the factors C and R in CUR:

CUR=A:Jc1:kUAJr1:k:A:Jc1:kVTE23

where we take U to satisfy the system UR = V T: In Figure 7, we compare the relative errors obtained with different approximations at the same rank. For matrices with mild singular value decay, the low-rank SVD obtained via a randomized scheme (with oversampling) gives significantly closer to optimal performance (to true truncated SVD) than other decompositions.

## 8. An introduction to Backus-Gilbert inversion

As previously mentioned, damped least-squares (DLS) techniques are commonly exploited to solve linear, discrete inversion problems, such as those encountered in seismic tomography [15, 16]. To break the nonuniqueness of the least-squares solution, DLS inversion schemes often rely on ad hoc regularization strategies (e.g., model norm damping or smoothing), designed to subjectively favor the model simplicity. However, in regions of poor seismic data coverage, DLS methods may lead to locally biased model solutions—potentially causing model misinterpretations [17]. In other words, DLS models may represent “biased averages” over the true-model parameters. Most geotomographical studies suffer from uneven data coverages, and thus are concerned by these averaging bias effects. For example, teleseismic body-wave ray-paths irregularly sample the Earth’s interior, because earthquakes typically are concentrated along oceanic ridges or subduction zones and seismometers are located over continental areas or oceanic islands.

A fundamentally different approach is that of linear Backus-Gilbert inversion [18, 19, 20], which belongs to the class of optimally localized averages (OLA) methods. In the discrete version of the Backus-Gilbert theory, one aims at evaluating (weighted) averages of the true-model parameters. That is, the Backus-Gilbert method seeks to determine unbiased model estimates. Over the past half century, many authors have considered that, in addition to being computationally very intensive, it could be a clumsy affair in the presence of data errors to practically implement the Backus-Gilbert method to (large-scale) tomographic applications [15, 21, 22, 23]. In the following, we aim to describe a recently developed—and (finally!) computationally tractable—tomographic approach [10] based on the Backus-Gilbert philosophy.

The SOLA (subtractive optimally localized averages) method [24, 25] is a variant of the original Backus-Gilbert approach, which has been exploited to solve helioseismic inversion problems [26, 27]. As a remark, Pijpers and Thompson [24] termed this alternative the SOLA method, though it may have been rediscovered independently by different authors [28, 29]. SOLA retains all the advantages of the original Backus-Gilbert method, but is much more computationally efficient and versatile in the construction of resolving (averaging) kernels. Recently, SOLA has been introduced and adapted to large-scale, linear and discrete “tomographic” problems by Zaroli [10]. We now briefly review the SOLA inversion scheme, tailored to seismic tomography.

In this section, let us slightly change the notations about linear inverse problems, to keep closer with those preferred in the geosciences community [10, 17]. Let us consider linear, discrete forward problems of the form:

d=Gm+n,E24

where d=di1iNdenotes the data, G=Gij1i,jN,Mthe sensitivity matrix, m=mj1jMthe true-model parameters, and n=ni1iNthe noise. The sensitivity matrix elements are the partial derivatives of the data with respect to the model parameters: Gij=di/mj. Typically, in “large-scale” tomographic studies, one may have to deal with M105model parameters and N106data. Let us consider, without loss of generality, that the data are time-residuals, the model parameters are velocity anomalies, the model space is parametrized using regular-size cells (local and “orthonormal” parameterization), the noise is randomly drawn from a normal distribution N0σn, and the data covariance matrix is Cd=σn2IN. For local and “irregular” parametrizations, the reader is referred to [10]. It is a common practice to normalize both the data and sensitivity matrix by the data errors; thus Cd=IN.

One aims to find a model estimate, m̂, that can be expressed as a linear combination of the data:

m̂=Ĝd,

where the matrix Ĝdenotes some generalized inverse. The model estimate can be decomposed as

m̂model estimate=R̂mfiltered true model+Ĝnpropagated noise,E25

where

R̂=ĜG,E26

is often referred to as the model resolution matrix. The first term in right member of (25), R̂m, represents the filtered true model, and shows our inability, if R̂IM, to perfectly recover the true model. Here, we refer to the k-th row of the resolution matrix, R̂k.=R̂kj1jM, as the resolving kernel that linearly relates the k-th parameter estimate, m̂k, to the true-model parameters:

m̂k=j=1MR̂kjmj,ignoring the term of propagated noise.E27

Therefore, we wish that R̂mrepresents an unbiased averaging over the true model parameters, m. This means that, for any parameter index k1M, we wish that R̂k. is non-negative and satisfies to

j=1MR̂kj=1.E28

The second term in right member of (25), Ĝn, denotes the propagated noise (i.e. the propagation of data errors) into the model estimate. Robust model interpretations require accurate appraisals of model estimates, that is to compute and carefully analyze both R̂and the model covariance matrix

Cm̂=ĜCdĜT.E29

As a remark, for DLS models this would also mean to quantify averaging bias effects (if any)—see [17]. The model estimate m̂, resolution R̂, and covariance Cm̂can be inferred from the generalized inverse Ĝ; efficiently computing the full generalized inverse is then crucial for any linear inverse problem. As we shall see, in the “SOLA Backus-Gilbert” approach the generalized inverse is directly determined.

The original Backus-Gilbert scheme consists in constructing the most peak-shaped resolving kernel (peaked around each model parameter location), while moderating at most the propagated noise into the model estimate. The key idea in the SOLA method is to specify an a priori “target form” for each resolving (averaging) kernel. One needs to specify M target resolving-kernels (hereafter, target kernels) such that their spatial extent represents some a priori estimate of the spatial resolving-length (around each parameter location). As an example, for 2-D tomographic studies the simplest target form could be circular (isotropic resolving-length); each target kernel would be constant inside such a circle and zero outside. Rather than minimizing the spread of each resolving kernel, as in the original Backus-Gilbert formulation, in the SOLA approach one aims at minimizing the integrated squared difference between each resolving kernel and its associated target kernel. Each row of the SOLA generalized inverse is individually computed by solving a specific minimization problem—the full computation of Ĝis then extremely parallel. The k-th row, Ĝk.=Ĝki1iN, is found such that:

minĜk.j=1MR̂kjTjk2resolution misfit + ηk2σm̂k2model variance,s.t.j=1MR̂kj=1,E30

where ηkand tk=Tjk1jMare the k-th tradeoff parameter (resolution misfit versus model variance) and target resolving-kernel vector, respectively; k is the index of considered model parameter. Because of the additional constraint in (30), the k-th parameter estimate, m̂k, is expected to be unbiased (provided that its corresponding resolving kernel is (mostly) non-negative)—so for the model estimate m̂. Though not strictly necessary, here all M target kernels are imposed to be unimodular:

j=1MTjk=1,k1M.E31

The system to be solved for the k-th row of the SOLA generalized inverse then writes as follows:

GGT+ηk2INĜk.=Gtk,s.t.j=1Mi=1NĜkiGij=1.E32

As a remark, since only a single (k-th) parameter index is treated at a time in (32), it could be difficult to ensure that all M selected values for the tradeoff parameters (ηk) would lead to “globally coherent” model solutions. However, it seems [10, 17] that globally coherent tomographic images can be obtained when using: (1) target kernels whose size is tuned to the spatially irregular data coverage (for instance using seismic ray-paths density as a proxy for the spatial variations of the local resolving-length); and (2) constant-valued tradeoff parameters, that is:

ηk=η,k1M.E33

In practice, it seems that ηmay (roughly) be determined from analyzing a few curves of tradeoff between jR̂kjTjk2and σm̂k2, for some randomly chosen parameter index (k). Let us now define the following quantities [10, 17, 30]:

xk=xik1iN,xik=Ĝkix̂k=xik2iNci=j=1MGijc=ci1iN,ĉ=ci/c12iNe1=δi11iNB=ĉTIN1Qη=GTBηĉTykη=tkc11GTe1c11η,E34

where c1 is assumed to be nonzero and δdenotes the Kronecker symbol. Solving (32) therefore consists in solving for x̂kthe following normal equations:

QηηIN1X̂k=ykη0N1,E35

using for instance the LSQR algorithm [31], and then to infer the final solution x(k) (i.e., the k-th row of the SOLA generalized inverse) from x̂ksuch that:

xk=Bx̂k+c11e1.E36

Last, but not least, we now aim to discuss about the computational efficiency of the SOLA approach for computing the full generalized inverse (see [10]). First, the rows of the generalized inverse matrix can be computed in parallel on P processors, so that computing all M rows would take t×M/PCPU-time, where t is the average CPU-time to numerically solve (35). A crucial point is that the matrix Qη, of size (M + 1) × (N − 1), does not depend on the parameter index (k), so that it does not need to be recomputed M times—as it was required in the original Backus-Gilbert approach (see [10]). The vector ykηhas to be recomputed M times, but that task is computationally cheap. Qηand ykηcan easily be reconstructed if one aims at investigating different ηvalues (only the last row of Qηand last element of ykηdepend on η). Finally, simply re-ordering the rows of the sensitivity matrix G (and corresponding data), such that the first row of G is the sparsest one, allows the matrix Qηto be almost as sparse as G—this sparsity property is very useful when solving (35), in terms of storage, efficiency of the LSQR algorithm, and memory footprint.

Figure 8 shows an example of the SOLA method applied to global-scale seismic tomography [10], for which there are M = 38, 125 model parameters and N = 79, 765 data (teleseismic shear-wave time-residuals). Tomographic images represent isotropic, 3–D shear-wave velocity perturbations within the whole Earth’s mantle (with respect to some reference, radial absolute velocity model). Figure 8a and b displays the tomographic model m̂, at about 600 km depth, and its uncertainty σm̂computed as

σm̂=σm̂k1kM,σm̂k=i=1NĜki2,E37

(since the data are normalized by their errors), respectively. The form of each target kernel is that of a 3-D spheroid, corresponding to a priori lateral and radial resolving lengths that may be expected locally, at best, given the data coverage. Let us focus on the k′-th model parameter, marked by a green dot in Figure 8; a zoom-in on the tomographic model is shown in Figure 8c. Horizontal (600 km depth) and vertical cross-sections through the k′-th target kernel are displayed in Figure 8d and e, respectively. The corresponding k′-th resolving (averaging) kernel is similarly displayed in Figure 8f and g.

Finally, the “SOLA Backus-Gilbert” approach, introduced and adapted to large-scale, linear, discrete tomographic problems by Zaroli [10], allows to efficiently compute unbiased models, including their full resolution and covariance—enabling quantitative model interpretations [17].

## 9. Conclusion

In this work, we have presented several techniques useful to the practitioner in the field of inverse problems, with the aim to give an idea of when and how these techniques should be employed for various linear and nonlinear applications. We have discussed techniques such as sparse matrix storage, the use of pivoted factorizations for direct solves, 2, 1and intermediate penalty-based regularization strategies, nonlinear least squares problems, the construction and use of low-rank factorizations, and an application of the Backus-Gilbert inversion approach tailored to seismic tomography.

chapter PDF
Citations in RIS format
Citations in bibtex format

## More

© 2018 The Author(s). Licensee IntechOpen. This chapter is distributed under the terms of the Creative Commons Attribution 3.0 License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.

## How to cite and reference

### Cite this chapter Copy to clipboard

Sergey Voronin and Christophe Zaroli (May 30th 2018). Survey of Computational Methods for Inverse Problems, Recent Trends in Computational Science and Engineering, M. Serdar Çelebi, IntechOpen, DOI: 10.5772/intechopen.73332. Available from:

### chapter statistics

2Crossref citations

### Related Content

#### Recent Trends in Computational Science and Engineering

Edited by Serdar Celebi

Next chapter

#### Evaluating the Hypersonic Leading-Edge Phenomena at High Reynolds and Mach Numbers

By Frederick Ferguson, Julio Mendez and David Dodoo-Amoo

#### Numerical Simulations of Physical and Engineering Processes

Edited by Jan Awrejcewicz

First chapter

#### Numerical Solution of Many-Body Wave Scattering Problem for Small Particles and Creating Materials with Desired Refraction Coefficient

By M. I. Andriychuk and A. G. Ramm

We are IntechOpen, the world's leading publisher of Open Access books. Built by scientists, for scientists. Our readership spans scientists, professors, researchers, librarians, and students, as well as business professionals. We share our knowledge and peer-reveiwed research papers with libraries, scientific and engineering societies, and also work with corporate R&D departments and government entities.

View all Books