Open access peer-reviewed chapter

Survey of Computational Methods for Inverse Problems

Written By

Sergey Voronin and Christophe Zaroli

Submitted: 14 August 2017 Reviewed: 21 December 2017 Published: 30 May 2018

DOI: 10.5772/intechopen.73332

From the Edited Volume

Recent Trends in Computational Science and Engineering

Edited by M. Serdar Çelebi

Chapter metrics overview

1,218 Chapter Downloads

View Full Metrics

Abstract

Inverse problems occur in a wide range of scientific applications, such as in the fields of signal processing, medical imaging, or geophysics. This work aims to present to the field practitioners, in an accessible and concise way, several established and newer cutting-edge computational methods used in the field of inverse problems—and when and how these techniques should be employed.

Keywords

  • inverse problems
  • matrix factorizations
  • regularization
  • parameter estimation
  • model appraisal
  • seismic tomography

1. Introduction

In this work, we aim to survey several techniques useful to a practitioner in the field of inverse problems, where the solution to a vector of interest is given through a linear system Ax=b or through a set of nonlinear equations Fx=0. In our presentation below, we review both classical results and newer approaches, which the reader may not be familiar with. In particular, this chapter offers entries on the following material:

  • Matrix factorizations and sparse matrices

  • Direct solves and pivoted factorizations

  • Least squares problems and regularization

  • Nonlinear least squares problems

  • Low-rank matrix factorizations and randomized algorithms

  • An introduction to Backus-Gilbert inversion

Advertisement

2. Notation

In this chapter, we use the norm to refer to the spectral or operator norm, and p to refer to the p norm. We make frequent use of the QR decomposition and the SVD (singular value decomposition). For any M×N matrix A and (ordered) subindex sets Jr and Jc, A(Jr, Jc) denotes the submatrix of A obtained by extracting the rows and columns of A indexed by Jr and Jc, respectively; and A:Jc denotes the submatrix of A obtained by extracting the columns of A indexed by Jc. We will make use of the covariance and the variance matrix, which we define as follows in terms of the expected value:

covxy=ExExyEyT]andvarx=covxx.

By “Diag,” we refer to a diagonal matrix with nonzeros only on the diagonal. We make use of the so-called GIID matrices. These are matrices with independent and identically distributed draws from the Gaussian distribution. In the Octave environment (which we frequently reference), these can be obtained with the “randn” command. We assume the use of real matrices, although most techniques we describe extend to the complex case.

Advertisement

3. Matrix factorizations and sparse matrices

Let A be an M×N matrix with real or complex entries, and set r=minMN. We will make use of the singular value decomposition (SVD) of a real matrix ARM×N: if A is of rank r, then there exist URm×r, VRN×r and Rr×r such that

  1. UTU=I, VTV=I,

  2. =Diagσ1σ2σrRr×r is a diagonal matrix with σ1σ2σr>0, and

  3. A=UΣVT.

This is known as the economic form of the SVD [1]. For 1iminMN, the i-th largest singular value of A is defined to be σi, with σj=0 for j=r+1,,minMN whenever r<minMN. The generalized inverse of ARm×N with SVDA=UΣVT, is defined as A+=VΣ1UT (and 1=Diagσ11σ21σr1Rr×r). By a rank deficient matrix, we imply a nonlinear decay of singular values σi. In this case, the numerical rank of A may be smaller than the actual rank due to the use of finite precision arithmetic.

The (compact) QR-factorization of A takes the form

AP=QR,m×nn×nm×rr×nE1

where P is a permutation matrix, Q has orthonormal columns, and R is upper triangular. The permutation matrix P can more efficiently be represented via a vector JcZ+n of indices such that P=I:Jc where I is the n×n identity matrix. The factorization (1) can then be written as:

A:Jc=QRm×nm×rr×nE2

Another commonly used decomposition is the pivoted LU:

A:Jc=LU.m×nm×rr×nE3

with L a lower triangular and U an upper triangular matrix. In the Octave environment, these decompositions can be constructed, respectively, via the commands (QRI=qrA0;LUI=luA. The matrix P does not need to be explicitly formed. Instead, vector I gives the permutation information. The relation between the two in Octave is given by the command P:I=eyelengthI.

Many matrices in applications turn out to be sparse. They can be stored more efficiently, without the need to store all m×n elements. The simplest sparse format is the so called coordinate sparse format, common to, e.g., the Octave environment. In this format, we store the integers row, column, and floating point value for each nonzero of the sparse matrix A: a set of triplets of the form ijv. However, we do not need to store all the row and column indices of the nonzero elements. Below, we summarize the two commonly used sparse formats for an example matrix.

A=063010807002

The compressed column and row formats for this matrix are given by the vectors:

ic=120012,pc=02356,dc=176382,

and

ir=210230,pr=0246,dr=361827.

In the compressed column format, the dc array stores the nonzero elements, scanned row by row. The array ic stores the row index of the corresponding data element, and the array pc stores the index of the start of each column in the data array dc. Similarly, for the compressed row format, all the column indices of nonzeros are given, but the row information is compressed by giving in pr, the index of the start of each row in dr. Moreover, if needed, the three vectors for the sparse representation above can be further compressed, with, e.g., lossless compression techniques, such as arithmetic coding [2]. BLAS operations on sparse matrices can be performed directly using these storage formats. Below, we list the pseudocode for the operations y1=Ax1 and y2=ATx2 for a m×n sparse matrix A stored with compressed row format.

function y = mat_mult (A, x)function y = mat_trans_mult (A, x)
 y = zeros (m, 1); y = zeros (n, 1);
for i = 1:mfor i = 1:m
  for j = pr (i): pr (i + 1)  for j = pr (i): pr (i + 1)
   y (i) = y (i) + dr (j) * × (i r (j));    y (i r (j)) = y (i r (j)) + dr (j) * × (i);
  end  end
endend
endend

Advertisement

4. Direct solves

Given a linear system Ax=b with a square matrix A which is invertible detA0, the solution x can be constructed through the inverse of A, built up using Gaussian elimination. For relatively small systems, the construction of such solutions is often desired over least squares formulations, when a solution is known to exist. Typically elimination is used to construct the factorization of A into a QR or LU decomposition. The construction of factorizations (QR, LU) with column pivoting can be applied to system solves involving rank deficient matrices. As an example, consider the pivoted QR factorization AP=QR. Here AP=A:I, a rearrangement of the columns of A upon multiplication with permutation matrix P. Plugging into Ax=b yields QRPTx=bQRy=bRy=QTb, which is an upper triangular system, and can be solved by back substitution. A simple permutation PTx=yx=Py yields the solution x.

Similarly, suppose we have the pivoted LU factorization AP=LU. Then, plugging into Ax=b yields LUPTx=b. Next, set z=UPTx=Uy with y=PTx. Then Lz=b (which is a lower triangular system) can be solved by forward substitution for z, while Uy=z can be solved by back substitution for y. Again applying a permutation matrix to x=Py yields the result. Notice that multiplying by P can be done efficiently, simply be re-arranging the elements of y. The implementations of the back substitution and forward substitution algorithms are given below.

% Solve Lz = b% Solve Uy = z
function z = fwd_sub (L, b)function y = back_sub (U, z)
 n = length (b); n = length (z);
 z = zeros (n, 1); y = zeros (n, 1);
 for i = 1:nfor i = n: −1:1
  z (i) = (b (i) − L(i,:) * z) /L(i, i);  y (i) = (z (i) − U(i,:) y) /U(i, i);
endend
endend

Advertisement

5. Regularization

5.1. Least squares

Prior to discussing two-norm or what is more commonly known as Tikhonov regularization, we mention the least squares problem:

x¯lsq=argminxAxb22E4

This formulation arises due to the noise in the right hand side b, in which case it does not make sense to attempt to solve the system Ax=b directly. Instead, if we have an estimate for the noise norm (that is, b=b¯+e with unknown noise vector e, but we can estimate e2), then we could seek a solution x such that Axb2e2. Let us now look at the solution of (4) in more detail. As the minimization problem is convex and the functional quadratic, we obtain the minimum by setting the gradient of the functional to zero:

xAxb22=0ATAx=ATbE5

A common choice of solution to the quadratic Eq. (5) would be directly through the generalized inverse:

x=ATA+ATb=VΣ2VTVΣUTb=A+b,E6

because of all the solutions to ATAx=ATb,A+b has the smallest 2-norm: ATAx=ATb if and only if x=A+b+d for some dkerATA=rangeATA=rangeA+, and

A+b+d2=A+b2+2dTA+b+d2=A+b2+d2A+b2.

Typically, the least squares problem is solved by an iterative method such as conjugate gradients (CG) or related techniques such as LSQR. Whichever way the solution to the normal equations in (5) is obtained, it will be close A+b and share its properties.

First, if A has small singular values, the norm of the solution A+b=VΣ1UTb will be very large because of the Σ1 matrix. Another disadvantage of (4) is that the solution A+b will be very sensitive to any noise in b or even in A (if any approximations in the calculations are used). Suppose the noise vector e behaves like white noise. Its different entries are uncorrelated, each having mean 0 and standard deviation ν. If, in addition, the elements of b¯ and e are uncorrelated, we have:

vare=EeEeeEeT=EeeT=ν2I,

and

varb=EbEbbEbT=EeeT=ν2I.

We may then estimate the norm of the variance of the solution vector:

varx=varA+b=varVΣ1UTb=VΣ1UTvarbUΣ1TVT=ν2VΣ2VTvarx2=ν2VΣ2VT2=ν2σmin2,

where σmin is the smallest magnitude singular value of A. We can clearly see that when A has an appreciable decay of singular values (such that σmin is small relative to σ1), the solution x=A+b will be sensitive to data errors. For these reasons, adding additional terms (regularization) to the optimization problem is often necessary.

5.2. Tikhonov regularization

Having discussed the least squares approach we turn our attention to the simplest form of Tikhonov Regularization:

x¯tik=argminxAxb22+λx22E7

where λ > 0 is a scalar regularization parameter, which controls the tradeoff between the solution norm x2 and the residual fit Axb2. Since the functional in brackets is convex, we can get the solution by again setting the gradient to zero:

xAxb22+λx22=02ATAx¯b+2λx¯=0ATA+λIx¯=ATbE8

If we plug in the SVD of A=UΣVT, we get:

x¯=UΣVTTUΣVT+λI1ATb=VΣTΣ+λI1VTVΣTUTb=VΣTΣ+λI1ΣTUTb=VDiagσiσi2+λUTb=VDUTb

We see that the effect of the regularization is to filter the small singular values σi, by replacing each σi by σiσi2+λ, which prevents the singular values smaller than λ from dominating the solution. If we now compute the norm of the solution variance, we obtain:

varx¯=varVDUTb=VDUTvarbUDVT=v2VD2VTvarx¯2=v2VD2VT2=v2D22v24λ.

The result follows because the function httt2+λ satisfies ht=0 at t=±λ with hλ=12λ. Thus, the solution to the system from (8) (even if obtained from a CG type scheme after a finite number of iterations) relieves the problems due to small singular values and noise, which affect the solution from (5).

In practice, a slight generalization of (7) is performed by adding a regularization matrix L:

x¯tikL=argminxAxb22+λ1x22+λ2Lx22E9

Generally, L is some kind of sharpening (difference) operator. The penalization of the term Lx, thus results in smoothing. If the coordinate system x is one-dimensional, it could take the form:

L1=110001000011

When the coordinate system corresponding to the model vector x is higher dimensional, L will be correspondingly more complicated and will generally have block structure. Note that the solution to (9) can be obtained through the linear equations:

ATA+λ1I+λ2LTLx¯=ATbE10

and can also be cast as an augmented least squares system and solved through its corresponding augmented normal equations:

x¯=argminxAλ1Iλ2Lxb0022Aλ1Iλ2LTAλ1Iλ2Lx¯=Aλ1Iλ2LTb00

This is an important point with regards to implementation, as it means that codes which solve the normal equations for standard least squares can be readily modified to solve (9). If L is a smoothing operator, the parameters λ1 and λ2 effect the degree of norm damping and model smoothing, respectively. Notice that increasing λ2 from zero also changes the solution norm, so λ1 may need to be altered to compensate.

5.3. Sparse regularization and generalized functional

The 2 penalty in (7) has the effect of penalizing the solution norm in a way which encourages all coefficients to be small (as λ is increased). For very large λ, only x = 0 would result in the required minimum but for modest values (below, e.g., ATb), the effect would be to force all solution coefficients to have smaller magnitudes with increasing λ, but would not make any of them explicitly zero. In many applications, a sparse solution is sought (where, a significant percentage of the coefficients of x are zero). A so-called 0 measure is an indicator function for the number of nonzeros of x. This measure is not a norm, as it does not satisfy, e.g., the basic triangle inequality. Constraining the 0 measure (e.g., Axb<ϵ, minx0), leads to a combinatorially difficult optimization problem.

Figure 1.

Comparison of geophysical models recovered via (10) with λ2=0 and λ2>0.

A key insight of compressive sensing is that the minimization with respect to the 0 measure and the convex 1 norm (the sum of the absolute values of the components of a vector) produce the same result (the sparsest solution) under some strong conditions (i.e., RIP, restricted isometry property) on A [3]. In practice, a nonrandom A from a physical application would not satisfy the RIP conditions. On the other hand, the minimization with respect to the 1 norm (and the p-norms for 0 < p < 2, convex only for p1) produce sparse solutions (but not-necessarily the sparsest one at a given residual level). Sample illustrations for the 2d case are given in Figure 2, where we can observe that in two dimensions, the minimization of the p norm for p1 results in one of the two components equal to zero. To account for the possibility of employing a sparse promoting penalty and also for more general treatment of the residual term, which we discuss more below, we will consider the two-parameter functional [4]:

Fl,px=Axbll+λxpp=i=1mj=1mAijxjbil+λi=1nxip,E11

with F˜p=F2,p (with l = 2 being the most-common residual-based penalty). For p < 2, the functional F˜p is not differentiable. This means that the minimum value cannot be obtained by setting the gradient equal to zero as for 2-based minimization. A particularly well-studied example is the 1 case, which is the closest convex problem to the 0 penalty. Convexity of F˜1x guarantees that any local minimizer is necessarily a global one. In this case, an algorithm can be constructed which decreases the functional value and tends to the (global) minimizer of F˜1x. One such method is called the iterative soft thresholding algorithm (ISTA) and relies on the soft thresholding function Sτx, defined as:

Sτxk=sgnxkmax0xkτ,k=1,,n,xRn.

Figure 2.

Illustration of family of functions yp for p02 and sample solutions to ax1+bx2=c subject to min xp for p = 2, 1, 0.5.

The benefit of this function is two-fold: it explicitly sets small components of x to zero (promoting sparsity) and is continuous (unlike the related hard thresholding function which simply zeros out all components smaller than τ in absolute value). The soft thresholding function satisfies a useful identity:

Sτb=argminxxb22+2τx1,

which is utilized with a surrogate functional approach to construct the ISTA scheme:

xn+1=Sτxn+ATbATAxn

This algorithm converges to the 2 minimizer for any initial guess and with A2 (the spectral norm of A) being less than 1 (which can be accomplished simply by rescaling). The spectral norm of a matrix A can easily be estimated using so called power iteration [1]. Let us assume that A is square. If it is not we can take the matrix ATA in it’s place and take the square root of the eigenvalue found to be the estimate for the spectral norm. If we take a vector x0 and write it as a linear combination of eigenvectors of A, then x0=α1v1++αnvn. It follows that the iterative computation xm=Axm1 yields (plugging in Avk=λkvk):

α1λ1mv1++αnλnmvn=λ1m[ α1v1+α2(λ2λ1)mv2++λn(λnλ1)mvn ]limmxmλ1m=α1v1,

a scalar multiple of the dominant eigenvector. A simple computation yields the dominant eigenvalue. In practice, a much faster converging scheme called FISTA (Fast ISTA) [5] is utilized, which is a slight reformulation of ISTA applied to a linear combination of two previous iterates:

xn+1=Sτyn+ATbATAyn,yn=xn+tn11tnxnxn1,tn+1=1+14tn22,E12

where t1 = 1. This algorithm is simple to implement and readily adapts to parallel architectures. For challenging problems (such as when the decay of singular values of A is rapid), the thresholding function Sτ can be slightly altered (see Figure 3), but the same iterative scheme (12) can be utilized. Two possible approaches are either to vary the thresholding, starting from soft thresholding and slowly approaching the (discontinuous) hard thresholding function, or to use a function which better mimics hard thresholding away from zero. The use of different thresholding functions alters the optimization problem being solved. Thresholding-based techniques are simple to implement but are not effective in all situations, particularly when only a few iterations are feasible (for example, when A is large). In this case, two interesting approaches are iteratively re-weighted least squares (IRLS) and convolution smoothing [4]. Both techniques replace the nonsmooth part of the functional (namely, the absolute value function |x|) by a smooth approximation. Moreover, both techniques have the particular advantage of being able to employ gradient-based methods (such as CG) at each iteration, considerably increasing the per-iteration performance. The IRLS approach is based on the approximation:

xk=xk2xk=xk2xk2xk2xk2+ϵ2

where in the rightmost term, a small ϵ0 is used, to insure the denominator is finite, regardless of the value of xk. The resulting algorithm [4] for the minimization of (11) can be written as:

ATRnA+DnTDnxn+1=ATRnb

with two diagonal iteration dependent matrices Dn and Rn. The diagonal matrix Dn has elements 12λpwkn and Rn has diagonal elements lrinl2 (for i where rin<ϵ, we can set the entry to lϵl2 with the choice of ϵ user controllable, tuned for a given application). Here, the residuals rin=Axnbi and the iteration dependent weights are given by:

wkn=1xkn2+ϵn22p2.

Figure 3.

Illustrations of different thresholding functions [6, 7].

The diagonal matrices (or simply the vectors holding the diagonal elements) are updated at each iteration and the system in (6.10) can be solved approximately via a few iterations of CG- or LSQR-based algorithms. Another advantage of the IRLS approach is that the powers p can be made component dependent. This then allows for better inversion of partially sparse signals (if of course, the location of the sparse part can be estimated with some accuracy). An example is illustrated in Figure 4 and further discussed in [8]. Another approach discussed in [4] is based on a smooth approximation of the absolute value function f(t) = |t| obtained via convolution with a sequence of Gaussian kernels, which have approximately shrinking support. The resulting “conv-CG” method is suitable especially for rapid warm start acceleration.

Figure 4.

Illustration of half-sparse, half-dense signal recovery with different algorithms.

5.4. Alternate penalty functions and regularization parameter selection

We mention here the classical image deconvolution problem. Given a blurring source g, such as a 2D Gaussian function, we can produce a blurry image s from an unaltered source f via convolution s=fg+n, where n is some additive noise component. For such situations, a TV (total variation) norm penalty is frequently used, for purposes of noise removal [9]. For a 2-D signal (such as an image), the TV penalty can be written as Vs=i,jsi+1jsi,j2+si,j+1si,j2. Sometimes, the alternate approximation i,jsi+1,jsi,j+si,j+1si,j is utilized. Various iterative schemes have been developed for such penalty functions [9].

Both the 2-based approaches and sparsity promoting regularization schemes (as well as TV-norm penalty functionals) utilize one or more regularization parameters. In the case of Tikhonov regularization with smoothing (as in (9)) more than one parameter is present. In this case, the second (smoothing) parameter can generally be set according to the desired smoothing effect, once the first parameter λ1 is chosen (with a fixed value of λ2) and then, λ1 can be adjusted to achieve a desired solution norm. Thus, we focus here on techniques to adjust λ1, which we simply refer to as λ.

The standard way to choose the parameter is to use the L-curve technique starting at a large λ (generally a value close to ATb is a good choice) and decreasing down in logarithmic fashion using the logarithmically spaced sequence:

S=logλmaxlogλminN1;λi=explogλmaxSi1,i=1,,N.

The parameter N can vary by application but is typically in the range [5, 10]. Two typical Strategies for parameter selection are employed.

The first is based on a target residual value, typically determined by the estimate of the noise norm. At every λ after the initial value we reuse the previous solution as the initial guess for the CG scheme at the current λ. We can use the solution xλ for which Axλb is closest to the desired residual level (or refine further the solution at this λ with more CG iterations).

If however, the target residual norm is not available, other techniques must be used. We discuss a method using the so-called L-curve where for the norm damping problem (7), we plot a curve composed of points logAxλblogxλ which we can obtain using the same continuation procedure previously discussed. The curve represents the tradeoff between the residual value Axλb and the solution norm xλ. In practice, neither of these quantities should dominate over the other. Hence, an established strategy is to look for the point of maximum curvature along the L-curve [11]. If we set:

ϵ¯=logxλpandρ¯=logAxλbl,E13

where xλ is the solution of (11) at the particular value of λ. We can then compute the curvature by the formula:

c¯λ=2ρ¯ϵ¯ρ¯ϵ¯ρ¯2+ϵ¯232,E14

where the derivative quantities can be approximated via finite differences. We illustrate various plots for a synthetic example in Figure 5. In the residual plot, the target residual is taken to be the magnitude of the noise vector norm. We can also see that the lowest percent error between x¯λ and the true x occurs at a value of λ roughly corresponding to the highest curvature of the L-curve. In fact, for this example, the curvature method gives a better estimate of good λ than the residual curve technique.

Figure 5.

Regularization parameter picking. Set 1: Residuals and percent errors vs. λ fraction (fraction of ATb). Set 2: L-curve and curvature curve as a function of λ fraction.

Advertisement

6. Nonlinear least squares (NLS) problems

In many cases, the inverse problem may be posed in terms of a nonlinear function F (x, t) with x a vector of variables, which may be time dependent with parameter t. We first describe, here, the popular Newton-Gauss method for NLS [1]. Let gx=12rx2 with rix=yiFxti. Then, the NLS problem takes the form: x¯=argminxgx. Setting gx=0, yields with Newton’s method:

xn+1=xn2gxn1gxn

Expanding the gradient and Hessian of g yields:

gx=i=1mrixrix=JTrx whereJ=Jrx2gx=i=1mrixrixT+i=1mrix2rix=JTJ+TxJTJ.

where Tx=i=1mrix2rixandJrxi:=rixT=FxtiT. The approximation Tx0 is used in the expression for the Hessian in the Gauss-Newton method, yielding a simple iterative scheme:

xn+1=xnJnTJn1JnTrn.

Unfortunately, this method is not stable and will typically not converge if initialized far away from a minimum solution [1]. Improvements include the introduction of a step size parameter:

xn+1=xnαnJnTJn1JnTrnαn=argminαgxnαsnwithJnTJnsn=JnTrn.

and of the use of a regularizer (e.g., Levenberg-Marquardt method [1]): where the system JnTJny=JnTrn is replaced by an 2-norm penalty regularized system, JnTJn+λIy˜=JnTrn.

Advertisement

7. Low-rank matrix factorizations

In many applications, there are large matrices with rapidly decaying singular values. In such cases, low-rank matrix approximations like the low-rank SVD are useful for compression, speed gains, and data analysis purposes. For ARm×n, the low-rank SVD of rank k (with k < min(m, n)) is the optimal matrix approximation of A in the spectral and Frobenius norms. Taking p = min(m, n), we define the low-rank SVD of rank k by Ak by taking into account only the first k < p singular values and vectors: that is, with UkRm×k consisting of the first k columns of U,Σk=Diagσ1σkRk×k consisting of k rows and columns of Σ, and VkRn×k consisting of the first k columns of V:

Ak=j=1kσjujvjT=UkkVkT,E15
Uk=u1u2uk,Vk=v1v2vk,and=σ10000σ20000σ300000σk

By the Eckart-Young theorem [12]:

AAk=σk+1,

when the error is measured in the spectral norm, and

AAkF=j=k+1pσj21/2

in the Frobenius norm. When k ≪ p, the matrices Uk, Σk and Vk are significantly smaller (cost of storage of all nonzeros is mk + nk + k) than the corresponding full SVD matrices U, Σ, and V (cost of storage is mp + np + p) and that of A (cost of storage is mn, but only some fraction of this if A is sparse). While the construction of Ak is expensive (requiring in most cases the SVD of A), it can be approximated very accurately via randomized algorithms, which requires only the SVD of a smaller matrix. Various randomized algorithms for constructing the low-rank SVD and related factorizations are described in [13]. Techniques for computing the low-rank SVD of a matrix rely on a simple principal. An orthonormal matrix QRm×r (with r = k + l where l is a small oversampling parameter, e.g., l = 5), is computed such that QQTAA. If in fact r is large enough so that QQTA=A, the range of A is a subset of the range of Q. Thus, when QQTAA, we expect the range of Q to capture a good portion of the range of A, a statement which can be made rigorous with some analysis. In this case, we form the smaller matrix B=QTA, where BRr×n, possibly much smaller than the m × n matrix A. Instead of performing the SVD on A, we can obtain the SVD of B=UΣVT. If AQQTA=QB, then AQUΣVT and the later will form a low-rank SVD approximation for A (if we only take the first k singular vectors and values of the corresponding factorization). Notice that when A is rectangular, the eigen-decomposition of the BBT or BT B matrices can be used to construct the approximate low-rank approximation of A.

A separate problem is the construction of a suitable matrix Q from A. Again, the idea is to construct as small (in terms of column number) as possible Q with orthonormal columns, such that Q captures a good chunk of the range of A. When A is a matrix of known rank k then (in MATLAB notation), simply setting Q=qrAΩ0 where ΩRn×k+l is a GIID matrix, with l a small over-sampling parameter, produces a valid matrix Q for projection. When the tail singular values (those smaller than σk) are still expected to be significant, a power sampling scheme turns out to be effective. Instead of setting Y = AΩ and performing QR of Y, we use the matrix Y=AATqAΩ with q ≥ 1. Plugging in the SVD of A, we obtain AATqA=UΣ2q+1VT, which has the same eigenvectors as A but much faster decaying singular values. Care must be taken when taking powers of matrices, to prevent multiplying matrices whose singular values are greater than one in magnitude. However, when the rank of the matrix A is not known, it is hard to use this approach, since the optimal size of the random matrix Ω to use would not be known. In this situation, a blocked algorithm can be employed [14], where on output with user supplied ϵ>0 parameter, an orthonormal matrix Q and matrix B are produced such that QBA<ϵ where B = QT A. Then, any number of standard low-rank matrix factorizations can be computed by operating on the matrix B instead of A. The basic steps of the proposed algorithm are given in Figure 6. We note that the resulting Q matrix can be utilized also for purposes of model reduction (e.g., one can use the reduced linear system QT Ax = QT b as an approximation to the full system Ax = b). That is, one can use the reduced linear system QTAx=QTb as an approximation to the full system Ax=b (or to replace A and b with the projected values in the least squares formulation); which has applications to e.g. accelerate image deblurring. The construction of Q for large matrices can in practice be done in parallel by employing the algorithm in [13] over row blocks of the matrix, as illustrated in Figure 6.

Figure 6.

A blocked and adaptive version of the accuracy enhanced QB algorithm proposed in [14].

The rank-k SVDAk=UkkVkT of a general m×n matrix A yields an optimal approximation of rank k to A, both in the operator (spectral) and Frobenius norms. On the other hand, even if A is a sparse matrix, the m×k and n×k factors Uk and Vk are typically dense. This means that if the matrix is approximately p percent filled, the matrix will have approximately N=p100m×n nonzeros. On the other hand, the rank k SVD will consist of approximately mk + k + nk nonzeros. For growing rank k, this quantity will quickly approach and even exceed N. Thus, even though the low-rank SVD is optimal for a given rank k, the choice of rank may be limited to relatively low values with respect to min(m, n) for sparse matrices, in order to achieve any useful compression ratios. (Of course, the usefulness of low-rank SVD representation is not simply limited to compression; indeed they are useful, e.g., for low-dimensional data projections; but the utility of a low-rank approximation is greatly reduced once the storage size of the factors exceeds that of the original matrix). Yet another aspect of the SVD which may be problematic is the difficulty in interpreting the eigenvectors present in Uk and Vk. While in many applications these have distinct meanings, they are not often easy to interpret for a particular data set.

It is thus plausible, in the above context, to look for factorizations which may not be optimal for rank k, but which may preserve useful properties of A such as sparsity and non-negativity, as well as allow easier interpretation of its components. Such properties may be found in the one- and two-sided interpolative decompositions and the CUR decomposition based on the pivoted QR decomposition. If we stop the QR procedure after the first k iterations, we obtain:

A:Jc=krkmQ1Q2×nkrkS1S2=Q1S1+Q2S2.E16
S1=knkkS11S12 and S2=knkk0S22E17
i.e.S=knkkrkS11S120S22E18
A:JC=Q1S11S12+Q20S22=mQ1S11kQ1S12+nkQ2S22.

From this formulation, we set

CA:JC1:k=Q1S11.Q1S1=Q1S11Q1S12=Q1S11IkS111S12=CIkTl,

where Tl is the solution to the matrix equation S11Tl = S12 (which if solved for Tl a column at a time, is simply a set of linear systems). It follows that we can write:

ACVT,whereVT=IkTlPT.E19

The one-sided ID of (rank k) is the approximate factorization:

AA:Jc1:kVT,m×nm×kk×nE20

where we use a partial column skeleton C=A:Jc1:k of a subset of the columns of A and V is a well-conditioned matrix. Clearly, C simply represents a subset of the columns of A chosen based on the pivoting strategy used in the QR factorization. The typical pivoting strategy is to choose the permutation matrix P (which simply dictates the re-arrangement of the columns of A) such that if S22 above is omitted, yielding:

APQ1S11S12+Q200=Q1S˜,

then the components of S˜ satisfy s˜11s˜22s˜nm. Several other pivoting strategies can be employed and each will yield a somewhat different re-arrangement of the columns of A.

Once the single-sided ID is defined, the two-sided ID can be constructed simply by obtaining a one-sided ID of A and that of AT. A set of select columns of AT obtained by this procedure, will be the same as the set of select rows of A. Thus, we can write the two-sided ID of (rank k) as:

AWAJr1:kJc1:kVT,m×nm×kk×kk×nE21

The procedure for the construction of the interpolative decompositions can be accelerated by means of randomization, just like for the low-rank SVD. This is possible by virtue of the result below [13].

Lemma 1 Let Ω˜Rl×m be a matrix with GIID entries. Then, for any aRm, we have that EΩ˜a2a2=l and VarΩ˜a2a2=2l.

Suppose, A is m×n and we draw a l×m GIID matrix Ω˜. Suppose, we then form the l×n matrix Z=Ω˜A. Then, EZ:j2A:j2=l. As the pivoting result depends heavily on the ratio of the individual column norms of A with respect to one another, the above result tells us that the ratio of column norms is roughly preserved in a matrix resulting from the multiplication of the original matrix by a Gaussian random matrix from the left. As the product matrix consists of fewer rows than the original matrix, the pivoted QR factorization is correspondingly cheaper to perform on the product matrix Z than on A, while the resulting permutation matrix (really the re-arrangement vector) will be similar for both cases.

The two-sided ID allows us to construct the popular Column/Row skeleton CUR (rank k) decomposition:

ACURm×nm×kk×kk×nE22

Suppose, we compute a two-sided rank k ID factorization forming the k×k column/row skeleton A(Jr(1: k), Jc(1: k)). Set:

C=A:JC1:kandR=AJr1:k:

We then set this to equal the factors C and R in CUR:

CUR=A:Jc1:kUAJr1:k:A:Jc1:kVTE23

where we take U to satisfy the system UR = V T: In Figure 7, we compare the relative errors obtained with different approximations at the same rank. For matrices with mild singular value decay, the low-rank SVD obtained via a randomized scheme (with oversampling) gives significantly closer to optimal performance (to true truncated SVD) than other decompositions.

Figure 7.

Relative errors for RSVD, ID, and CUR decompositions of rank k for matrices with two different rates of singular value decay (slower, faster).

Advertisement

8. An introduction to Backus-Gilbert inversion

As previously mentioned, damped least-squares (DLS) techniques are commonly exploited to solve linear, discrete inversion problems, such as those encountered in seismic tomography [15, 16]. To break the nonuniqueness of the least-squares solution, DLS inversion schemes often rely on ad hoc regularization strategies (e.g., model norm damping or smoothing), designed to subjectively favor the model simplicity. However, in regions of poor seismic data coverage, DLS methods may lead to locally biased model solutions—potentially causing model misinterpretations [17]. In other words, DLS models may represent “biased averages” over the true-model parameters. Most geotomographical studies suffer from uneven data coverages, and thus are concerned by these averaging bias effects. For example, teleseismic body-wave ray-paths irregularly sample the Earth’s interior, because earthquakes typically are concentrated along oceanic ridges or subduction zones and seismometers are located over continental areas or oceanic islands.

A fundamentally different approach is that of linear Backus-Gilbert inversion [18, 19, 20], which belongs to the class of optimally localized averages (OLA) methods. In the discrete version of the Backus-Gilbert theory, one aims at evaluating (weighted) averages of the true-model parameters. That is, the Backus-Gilbert method seeks to determine unbiased model estimates. Over the past half century, many authors have considered that, in addition to being computationally very intensive, it could be a clumsy affair in the presence of data errors to practically implement the Backus-Gilbert method to (large-scale) tomographic applications [15, 21, 22, 23]. In the following, we aim to describe a recently developed—and (finally!) computationally tractable—tomographic approach [10] based on the Backus-Gilbert philosophy.

The SOLA (subtractive optimally localized averages) method [24, 25] is a variant of the original Backus-Gilbert approach, which has been exploited to solve helioseismic inversion problems [26, 27]. As a remark, Pijpers and Thompson [24] termed this alternative the SOLA method, though it may have been rediscovered independently by different authors [28, 29]. SOLA retains all the advantages of the original Backus-Gilbert method, but is much more computationally efficient and versatile in the construction of resolving (averaging) kernels. Recently, SOLA has been introduced and adapted to large-scale, linear and discrete “tomographic” problems by Zaroli [10]. We now briefly review the SOLA inversion scheme, tailored to seismic tomography.

In this section, let us slightly change the notations about linear inverse problems, to keep closer with those preferred in the geosciences community [10, 17]. Let us consider linear, discrete forward problems of the form:

d=Gm+n,E24

where d=di1iN denotes the data, G=Gij1i,jN,M the sensitivity matrix, m=mj1jM the true-model parameters, and n=ni1iN the noise. The sensitivity matrix elements are the partial derivatives of the data with respect to the model parameters: Gij=di/mj. Typically, in “large-scale” tomographic studies, one may have to deal with M105 model parameters and N106 data. Let us consider, without loss of generality, that the data are time-residuals, the model parameters are velocity anomalies, the model space is parametrized using regular-size cells (local and “orthonormal” parameterization), the noise is randomly drawn from a normal distribution N0σn, and the data covariance matrix is Cd=σn2IN. For local and “irregular” parametrizations, the reader is referred to [10]. It is a common practice to normalize both the data and sensitivity matrix by the data errors; thus Cd=IN.

One aims to find a model estimate, m̂, that can be expressed as a linear combination of the data:

m̂=Ĝd,

where the matrix Ĝ denotes some generalized inverse. The model estimate can be decomposed as

m̂model estimate=R̂mfiltered true model+Ĝnpropagated noise,E25

where

R̂=ĜG,E26

is often referred to as the model resolution matrix. The first term in right member of (25), R̂m, represents the filtered true model, and shows our inability, if R̂IM, to perfectly recover the true model. Here, we refer to the k-th row of the resolution matrix, R̂k.=R̂kj1jM, as the resolving kernel that linearly relates the k-th parameter estimate, m̂k, to the true-model parameters:

m̂k=j=1MR̂kjmj,ignoring the term of propagated noise.E27

Therefore, we wish that R̂m represents an unbiased averaging over the true model parameters, m. This means that, for any parameter index k1M, we wish that R̂k. is non-negative and satisfies to

j=1MR̂kj=1.E28

The second term in right member of (25), Ĝn, denotes the propagated noise (i.e. the propagation of data errors) into the model estimate. Robust model interpretations require accurate appraisals of model estimates, that is to compute and carefully analyze both R̂ and the model covariance matrix

Cm̂=ĜCdĜT.E29

As a remark, for DLS models this would also mean to quantify averaging bias effects (if any)—see [17]. The model estimate m̂, resolution R̂, and covariance Cm̂ can be inferred from the generalized inverse Ĝ; efficiently computing the full generalized inverse is then crucial for any linear inverse problem. As we shall see, in the “SOLA Backus-Gilbert” approach the generalized inverse is directly determined.

The original Backus-Gilbert scheme consists in constructing the most peak-shaped resolving kernel (peaked around each model parameter location), while moderating at most the propagated noise into the model estimate. The key idea in the SOLA method is to specify an a priori “target form” for each resolving (averaging) kernel. One needs to specify M target resolving-kernels (hereafter, target kernels) such that their spatial extent represents some a priori estimate of the spatial resolving-length (around each parameter location). As an example, for 2-D tomographic studies the simplest target form could be circular (isotropic resolving-length); each target kernel would be constant inside such a circle and zero outside. Rather than minimizing the spread of each resolving kernel, as in the original Backus-Gilbert formulation, in the SOLA approach one aims at minimizing the integrated squared difference between each resolving kernel and its associated target kernel. Each row of the SOLA generalized inverse is individually computed by solving a specific minimization problem—the full computation of Ĝ is then extremely parallel. The k-th row, Ĝk.=Ĝki1iN, is found such that:

minĜk.j=1MR̂kjTjk2resolution misfit + ηk2σm̂k2model variance,s.t.j=1MR̂kj=1,E30

where ηk and tk=Tjk1jM are the k-th tradeoff parameter (resolution misfit versus model variance) and target resolving-kernel vector, respectively; k is the index of considered model parameter. Because of the additional constraint in (30), the k-th parameter estimate, m̂k, is expected to be unbiased (provided that its corresponding resolving kernel is (mostly) non-negative)—so for the model estimate m̂. Though not strictly necessary, here all M target kernels are imposed to be unimodular:

j=1MTjk=1,k1M.E31

The system to be solved for the k-th row of the SOLA generalized inverse then writes as follows:

GGT+ηk2INĜk.=Gtk,s.t.j=1Mi=1NĜkiGij=1.E32

As a remark, since only a single (k-th) parameter index is treated at a time in (32), it could be difficult to ensure that all M selected values for the tradeoff parameters (ηk) would lead to “globally coherent” model solutions. However, it seems [10, 17] that globally coherent tomographic images can be obtained when using: (1) target kernels whose size is tuned to the spatially irregular data coverage (for instance using seismic ray-paths density as a proxy for the spatial variations of the local resolving-length); and (2) constant-valued tradeoff parameters, that is:

ηk=η,k1M.E33

In practice, it seems that η may (roughly) be determined from analyzing a few curves of tradeoff between jR̂kjTjk2 and σm̂k2, for some randomly chosen parameter index (k). Let us now define the following quantities [10, 17, 30]:

xk=xik1iN,xik=Ĝkix̂k=xik2iNci=j=1MGijc=ci1iN,ĉ=ci/c12iNe1=δi11iNB=ĉTIN1Qη=GTBηĉTykη=tkc11GTe1c11η,E34

where c1 is assumed to be nonzero and δ denotes the Kronecker symbol. Solving (32) therefore consists in solving for x̂k the following normal equations:

QηηIN1X̂k=ykη0N1,E35

using for instance the LSQR algorithm [31], and then to infer the final solution x(k) (i.e., the k-th row of the SOLA generalized inverse) from x̂k such that:

xk=Bx̂k+c11e1.E36

Last, but not least, we now aim to discuss about the computational efficiency of the SOLA approach for computing the full generalized inverse (see [10]). First, the rows of the generalized inverse matrix can be computed in parallel on P processors, so that computing all M rows would take t×M/P CPU-time, where t is the average CPU-time to numerically solve (35). A crucial point is that the matrix Qη, of size (M + 1) × (N − 1), does not depend on the parameter index (k), so that it does not need to be recomputed M times—as it was required in the original Backus-Gilbert approach (see [10]). The vector ykη has to be recomputed M times, but that task is computationally cheap. Qη and ykη can easily be reconstructed if one aims at investigating different η values (only the last row of Qη and last element of ykη depend on η). Finally, simply re-ordering the rows of the sensitivity matrix G (and corresponding data), such that the first row of G is the sparsest one, allows the matrix Qη to be almost as sparse as G—this sparsity property is very useful when solving (35), in terms of storage, efficiency of the LSQR algorithm, and memory footprint.

Figure 8 shows an example of the SOLA method applied to global-scale seismic tomography [10], for which there are M = 38, 125 model parameters and N = 79, 765 data (teleseismic shear-wave time-residuals). Tomographic images represent isotropic, 3–D shear-wave velocity perturbations within the whole Earth’s mantle (with respect to some reference, radial absolute velocity model). Figure 8a and b displays the tomographic model m̂, at about 600 km depth, and its uncertainty σm̂ computed as

σm̂=σm̂k1kM,σm̂k=i=1NĜki2,E37

(since the data are normalized by their errors), respectively. The form of each target kernel is that of a 3-D spheroid, corresponding to a priori lateral and radial resolving lengths that may be expected locally, at best, given the data coverage. Let us focus on the k′-th model parameter, marked by a green dot in Figure 8; a zoom-in on the tomographic model is shown in Figure 8c. Horizontal (600 km depth) and vertical cross-sections through the k′-th target kernel are displayed in Figure 8d and e, respectively. The corresponding k′-th resolving (averaging) kernel is similarly displayed in Figure 8f and g.

Figure 8.

Example of a global geotomographical model and its associated resolution and uncertainty, obtained from using a “SOLA Backus-Gilbert” inversion approach [10]. (a) Model estimate, m̂, shown at 600 km depth; (c) model uncertainty, σm̂, shown at 600 km depth; (b) zoom-in on m̂ (600 km depth) around the k′-th parameter location, i.e., the green dot; (d, e) and (f, g) horizontal (600 km depth) and vertical cross-sections through the k′-th target (spheroid shape) and averaging kernels, respectively.

Finally, the “SOLA Backus-Gilbert” approach, introduced and adapted to large-scale, linear, discrete tomographic problems by Zaroli [10], allows to efficiently compute unbiased models, including their full resolution and covariance—enabling quantitative model interpretations [17].

Advertisement

9. Conclusion

In this work, we have presented several techniques useful to the practitioner in the field of inverse problems, with the aim to give an idea of when and how these techniques should be employed for various linear and nonlinear applications. We have discussed techniques such as sparse matrix storage, the use of pivoted factorizations for direct solves, 2, 1 and intermediate penalty-based regularization strategies, nonlinear least squares problems, the construction and use of low-rank factorizations, and an application of the Backus-Gilbert inversion approach tailored to seismic tomography.

References

  1. 1. Golub GH, Van Loan CF. Matrix Computations. Vol. 3. JHU Press; 2012
  2. 2. Bell T, McKenzie B. Compression of sparse matrices by arithmetic coding. In: Data Compression Conference, 1998 (DCC’98) Proceedings; IEEE; 1998. pp. 23-32
  3. 3. Candès EJ, Wakin MB. An introduction to compressive sampling. IEEE Signal Processing Magazine. 2008;25(2):21-30
  4. 4. Voronin S, Zaroli C, Cuntoor NP. Conjugate gradient based acceleration for inverse problems. GEM-International Journal on Geomathematics. 2017;8(2):219-239
  5. 5. Beck A, Teboulle M. A fast iterative shrinkage-thresholding algorithm for linear inverse problems. SIAM Journal on Imaging Sciences. 2009;2(1):183-202
  6. 6. Voronin S, Woerdeman HJ. A new iterative firm-thresholding algorithm for inverse problems with sparsity constraints. Applied and Computational Harmonic Analysis. 2013;35(1):151-164
  7. 7. Voronin S, Chartrand R. A new generalized thresholding algorithm for inverse problems with sparsity constraints. In: IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2013; IEEE; 2013. pp. 1636-1640
  8. 8. Voronin S, Daubechies I. An iteratively reweighted least squares algorithm for sparse regularization. arXiv preprint: arXiv:1511.08970. 2015
  9. 9. Wang Y, Yang J, Yin W, Zhang Y. A new alternating minimization algorithm for total variation image reconstruction. SIAM Journal on Imaging Sciences. 2008;1(3):248-272
  10. 10. Zaroli C. Global seismic tomography using Backus-Gilbert inversion. Geophysical Journal International. 2016;207(2):876-888
  11. 11. Hansen PC. The L-Curve and its Use in the Numerical Treatment of Inverse Problems. IMM, Department of Mathematical Modelling, Technical University of Denmark; 1999
  12. 12. Eckart C, Young G. A principal axis transformation for non-Hermitian matrices. Bulletin of the American Mathematical Society. 1939;45(2):118-121
  13. 13. Voronin S, Martinsson P-G. Rsvdpack: Subroutines for computing partial singular value decompositions via randomized sampling on single core, multi core, and gpu architectures. arXiv preprint: arXiv:1502.05366, 2:16. 2015
  14. 14. Martinsson P-G, Voronin S. A randomized blocked algorithm for efficiently computing rank-revealing factorizations of matrices. SIAM Journal on Scientific Computing. 2016;38(5):S485-S507
  15. 15. Aster RC, Borchers B, Thurber C. Parameter Estimation and Inverse Problems. Revised ed. Elsevier; 2012
  16. 16. Nolet G. A Breviary of Seismic Tomography. Cambridge, UK: Cambridge University Press; 2008
  17. 17. Zaroli C, Koelemeijer P, Lambotte S. Toward seeing the earth's interior through unbiased tomographic lenses. Geophysical Research Letters. 2017;44(22):11399-11408. DOI: 10.1002/2017GL074996
  18. 18. Backus G, Gilbert JF. Numerical applications of a formalism for geophysical inverse problems. Geophysical Journal of the Royal Astronomical Society. 1967;13:247-276
  19. 19. Backus G, Gilbert JF. The resolving power of gross earth data. Geophysical Journal of the Royal Astronomical Society. 1968;16:169-205
  20. 20. Backus G, Gilbert JF. Uniqueness in the inversion of inaccurate gross earth data. Philosophical Transactions of the Royal Society A. 1970;266(1173)
  21. 21. Menke W. Geophysical Data Analysis: Discrete Inverse Theory. Revised ed. San Diego: Academic Press; 1989
  22. 22. Parker RL. Geophysical Inverse Theory. Princeton: Princeton University Press; 1994
  23. 23. Trampert J. Global seismic tomography: The inverse problem and beyond. Inverse Problems. 1998;14:371-385
  24. 24. Pijpers FP, Thompson MJ. Faster formulations of the optimally localized averages method for helioseismic inversions. Astronomy and Astrophysics. 1992;262:L33-L36
  25. 25. Pijpers FP, Thompson MJ. The sola method for helioseismic inversion. Astronomy and Astrophysics. 1994;281:231-240
  26. 26. Jackiewicz J, Birch AC, Gizon L, Hanasoge SM, Hohage T, Ruffio J-B, Svanda M. Multichannel three-dimensional SOLA inversion for local helioseismology. Solar Physics. 2012;276:19-33
  27. 27. Rabello-Soares MC, Basu S, Christensen-Dalsgaard J. On the choice of parameters in solar-structure inversion. Monthly Notices of the Royal Astronomical Society. 1999;309:35-47
  28. 28. Larsen RM, Hansen PC. Efficient implementations of the sola mollifier method. Astronomy and Astrophysics Supplement Series. 1997;121:587-598
  29. 29. Louis AK, Maass P. A mollifier method for linear operator equations of the first kind. Inverse Problems. 1990;6:427-490
  30. 30. Nolet G. Solving or resolving inadequate and noisy tomographic systems. Journal of Computational Physics. 1985;61:463-482
  31. 31. Paige CC, Saunders MA. Lsqr: An algorithm for sparse, linear equations and sparse least squares. ACM Transactions on Mathematical Software. 1982;8:43-71

Written By

Sergey Voronin and Christophe Zaroli

Submitted: 14 August 2017 Reviewed: 21 December 2017 Published: 30 May 2018