Inverse problems occur in a wide range of scientific applications, such as in the fields of signal processing, medical imaging, or geophysics. This work aims to present to the field practitioners, in an accessible and concise way, several established and newer cutting-edge computational methods used in the field of inverse problems—and when and how these techniques should be employed.
- inverse problems
- matrix factorizations
- parameter estimation
- model appraisal
- seismic tomography
In this work, we aim to survey several techniques useful to a practitioner in the field of inverse problems, where the solution to a vector of interest is given through a linear system or through a set of nonlinear equations . In our presentation below, we review both classical results and newer approaches, which the reader may not be familiar with. In particular, this chapter offers entries on the following material:
Matrix factorizations and sparse matrices
Direct solves and pivoted factorizations
Least squares problems and regularization
Nonlinear least squares problems
Low-rank matrix factorizations and randomized algorithms
An introduction to Backus-Gilbert inversion
In this chapter, we use the norm to refer to the spectral or operator norm, and to refer to the norm. We make frequent use of the QR decomposition and the SVD (singular value decomposition). For any matrix A and (ordered) subindex sets Jr and Jc, A(Jr, Jc) denotes the submatrix of A obtained by extracting the rows and columns of A indexed by Jr and Jc, respectively; and denotes the submatrix of A obtained by extracting the columns of A indexed by Jc. We will make use of the covariance and the variance matrix, which we define as follows in terms of the expected value:
By “Diag,” we refer to a diagonal matrix with nonzeros only on the diagonal. We make use of the so-called GIID matrices. These are matrices with independent and identically distributed draws from the Gaussian distribution. In the Octave environment (which we frequently reference), these can be obtained with the “randn” command. We assume the use of real matrices, although most techniques we describe extend to the complex case.
3. Matrix factorizations and sparse matrices
Let A be an matrix with real or complex entries, and set . We will make use of the singular value decomposition (SVD) of a real matrix : if A is of rank r, then there exist , and such that
is a diagonal matrix with , and
This is known as the economic form of the SVD . For , the i-th largest singular value of A is defined to be , with for whenever . The generalized inverse of with , is defined as (and ). By a rank deficient matrix, we imply a nonlinear decay of singular values . In this case, the numerical rank of A may be smaller than the actual rank due to the use of finite precision arithmetic.
The (compact) QR-factorization of A takes the form
where P is a permutation matrix, Q has orthonormal columns, and R is upper triangular. The permutation matrix P can more efficiently be represented via a vector of indices such that where I is the identity matrix. The factorization (1) can then be written as:
Another commonly used decomposition is the pivoted LU:
with L a lower triangular and U an upper triangular matrix. In the Octave environment, these decompositions can be constructed, respectively, via the commands . The matrix P does not need to be explicitly formed. Instead, vector I gives the permutation information. The relation between the two in Octave is given by the command .
Many matrices in applications turn out to be sparse. They can be stored more efficiently, without the need to store all elements. The simplest sparse format is the so called coordinate sparse format, common to, e.g., the Octave environment. In this format, we store the integers row, column, and floating point value for each nonzero of the sparse matrix A: a set of triplets of the form . However, we do not need to store all the row and column indices of the nonzero elements. Below, we summarize the two commonly used sparse formats for an example matrix.
The compressed column and row formats for this matrix are given by the vectors:
In the compressed column format, the dc array stores the nonzero elements, scanned row by row. The array ic stores the row index of the corresponding data element, and the array pc stores the index of the start of each column in the data array dc. Similarly, for the compressed row format, all the column indices of nonzeros are given, but the row information is compressed by giving in pr, the index of the start of each row in dr. Moreover, if needed, the three vectors for the sparse representation above can be further compressed, with, e.g., lossless compression techniques, such as arithmetic coding . BLAS operations on sparse matrices can be performed directly using these storage formats. Below, we list the pseudocode for the operations and for a sparse matrix A stored with compressed row format.
|function y = mat_mult (A, x)||function y = mat_trans_mult (A, x)|
|y = zeros (m, 1);||y = zeros (n, 1);|
|for i = 1:m||for i = 1:m|
|for j = pr (i): pr (i + 1)||for j = pr (i): pr (i + 1)|
|y (i) = y (i) + dr (j) * × (i r (j));||y (i r (j)) = y (i r (j)) + dr (j) * × (i);|
4. Direct solves
Given a linear system with a square matrix A which is invertible , the solution x can be constructed through the inverse of A, built up using Gaussian elimination. For relatively small systems, the construction of such solutions is often desired over least squares formulations, when a solution is known to exist. Typically elimination is used to construct the factorization of A into a QR or LU decomposition. The construction of factorizations (QR, LU) with column pivoting can be applied to system solves involving rank deficient matrices. As an example, consider the pivoted QR factorization . Here , a rearrangement of the columns of A upon multiplication with permutation matrix P. Plugging into yields , which is an upper triangular system, and can be solved by back substitution. A simple permutation yields the solution x.
Similarly, suppose we have the pivoted LU factorization . Then, plugging into yields . Next, set with . Then (which is a lower triangular system) can be solved by forward substitution for z, while can be solved by back substitution for y. Again applying a permutation matrix to yields the result. Notice that multiplying by P can be done efficiently, simply be re-arranging the elements of y. The implementations of the back substitution and forward substitution algorithms are given below.
|% Solve Lz = b||% Solve Uy = z|
|function z = fwd_sub (L, b)||function y = back_sub (U, z)|
|n = length (b);||n = length (z);|
|z = zeros (n, 1);||y = zeros (n, 1);|
|for i = 1:n||for i = n: −1:1|
|z (i) = (b (i) − L(i,:) * z) /L(i, i);||y (i) = (z (i) − U(i,:) y) /U(i, i);|
5.1. Least squares
Prior to discussing two-norm or what is more commonly known as Tikhonov regularization, we mention the least squares problem:
This formulation arises due to the noise in the right hand side b, in which case it does not make sense to attempt to solve the system directly. Instead, if we have an estimate for the noise norm (that is, with unknown noise vector e, but we can estimate ), then we could seek a solution x such that . Let us now look at the solution of (4) in more detail. As the minimization problem is convex and the functional quadratic, we obtain the minimum by setting the gradient of the functional to zero:
A common choice of solution to the quadratic Eq. (5) would be directly through the generalized inverse:
because of all the solutions to has the smallest -norm: if and only if for some , and
Typically, the least squares problem is solved by an iterative method such as conjugate gradients (CG) or related techniques such as LSQR. Whichever way the solution to the normal equations in (5) is obtained, it will be close A+b and share its properties.
First, if A has small singular values, the norm of the solution will be very large because of the matrix. Another disadvantage of (4) is that the solution A+b will be very sensitive to any noise in b or even in A (if any approximations in the calculations are used). Suppose the noise vector e behaves like white noise. Its different entries are uncorrelated, each having mean 0 and standard deviation . If, in addition, the elements of and e are uncorrelated, we have:
We may then estimate the norm of the variance of the solution vector:
where is the smallest magnitude singular value of A. We can clearly see that when A has an appreciable decay of singular values (such that is small relative to σ1), the solution will be sensitive to data errors. For these reasons, adding additional terms (regularization) to the optimization problem is often necessary.
5.2. Tikhonov regularization
Having discussed the least squares approach we turn our attention to the simplest form of Tikhonov Regularization:
where λ > 0 is a scalar regularization parameter, which controls the tradeoff between the solution norm and the residual fit . Since the functional in brackets is convex, we can get the solution by again setting the gradient to zero:
If we plug in the SVD of , we get:
We see that the effect of the regularization is to filter the small singular values , by replacing each by , which prevents the singular values smaller than from dominating the solution. If we now compute the norm of the solution variance, we obtain:
The result follows because the function satisfies at with . Thus, the solution to the system from (8) (even if obtained from a CG type scheme after a finite number of iterations) relieves the problems due to small singular values and noise, which affect the solution from (5).
In practice, a slight generalization of (7) is performed by adding a regularization matrix L:
Generally, L is some kind of sharpening (difference) operator. The penalization of the term , thus results in smoothing. If the coordinate system x is one-dimensional, it could take the form:
When the coordinate system corresponding to the model vector x is higher dimensional, L will be correspondingly more complicated and will generally have block structure. Note that the solution to (9) can be obtained through the linear equations:
and can also be cast as an augmented least squares system and solved through its corresponding augmented normal equations:
This is an important point with regards to implementation, as it means that codes which solve the normal equations for standard least squares can be readily modified to solve (9). If L is a smoothing operator, the parameters and effect the degree of norm damping and model smoothing, respectively. Notice that increasing from zero also changes the solution norm, so may need to be altered to compensate.
5.3. Sparse regularization and generalized functional
The penalty in (7) has the effect of penalizing the solution norm in a way which encourages all coefficients to be small (as λ is increased). For very large λ, only x = 0 would result in the required minimum but for modest values (below, e.g., ), the effect would be to force all solution coefficients to have smaller magnitudes with increasing λ, but would not make any of them explicitly zero. In many applications, a sparse solution is sought (where, a significant percentage of the coefficients of x are zero). A so-called measure is an indicator function for the number of nonzeros of x. This measure is not a norm, as it does not satisfy, e.g., the basic triangle inequality. Constraining the measure (e.g., , ), leads to a combinatorially difficult optimization problem.
A key insight of compressive sensing is that the minimization with respect to the measure and the convex norm (the sum of the absolute values of the components of a vector) produce the same result (the sparsest solution) under some strong conditions (i.e., RIP, restricted isometry property) on A . In practice, a nonrandom A from a physical application would not satisfy the RIP conditions. On the other hand, the minimization with respect to the norm (and the -norms for 0 < p < 2, convex only for ) produce sparse solutions (but not-necessarily the sparsest one at a given residual level). Sample illustrations for the 2d case are given in Figure 2, where we can observe that in two dimensions, the minimization of the norm for results in one of the two components equal to zero. To account for the possibility of employing a sparse promoting penalty and also for more general treatment of the residual term, which we discuss more below, we will consider the two-parameter functional :
with (with l = 2 being the most-common residual-based penalty). For p < 2, the functional is not differentiable. This means that the minimum value cannot be obtained by setting the gradient equal to zero as for -based minimization. A particularly well-studied example is the case, which is the closest convex problem to the penalty. Convexity of guarantees that any local minimizer is necessarily a global one. In this case, an algorithm can be constructed which decreases the functional value and tends to the (global) minimizer of . One such method is called the iterative soft thresholding algorithm (ISTA) and relies on the soft thresholding function , defined as:
The benefit of this function is two-fold: it explicitly sets small components of x to zero (promoting sparsity) and is continuous (unlike the related hard thresholding function which simply zeros out all components smaller than τ in absolute value). The soft thresholding function satisfies a useful identity:
which is utilized with a surrogate functional approach to construct the ISTA scheme:
This algorithm converges to the minimizer for any initial guess and with (the spectral norm of A) being less than 1 (which can be accomplished simply by rescaling). The spectral norm of a matrix can easily be estimated using so called power iteration . Let us assume that is square. If it is not we can take the matrix in it’s place and take the square root of the eigenvalue found to be the estimate for the spectral norm. If we take a vector and write it as a linear combination of eigenvectors of , then . It follows that the iterative computation yields (plugging in ):
a scalar multiple of the dominant eigenvector. A simple computation yields the dominant eigenvalue. In practice, a much faster converging scheme called FISTA (Fast ISTA)  is utilized, which is a slight reformulation of ISTA applied to a linear combination of two previous iterates:
where t1 = 1. This algorithm is simple to implement and readily adapts to parallel architectures. For challenging problems (such as when the decay of singular values of A is rapid), the thresholding function can be slightly altered (see Figure 3), but the same iterative scheme (12) can be utilized. Two possible approaches are either to vary the thresholding, starting from soft thresholding and slowly approaching the (discontinuous) hard thresholding function, or to use a function which better mimics hard thresholding away from zero. The use of different thresholding functions alters the optimization problem being solved. Thresholding-based techniques are simple to implement but are not effective in all situations, particularly when only a few iterations are feasible (for example, when A is large). In this case, two interesting approaches are iteratively re-weighted least squares (IRLS) and convolution smoothing . Both techniques replace the nonsmooth part of the functional (namely, the absolute value function |x|) by a smooth approximation. Moreover, both techniques have the particular advantage of being able to employ gradient-based methods (such as CG) at each iteration, considerably increasing the per-iteration performance. The IRLS approach is based on the approximation:
with two diagonal iteration dependent matrices Dn and Rn. The diagonal matrix Dn has elements and Rn has diagonal elements (for i where , we can set the entry to with the choice of ϵ user controllable, tuned for a given application). Here, the residuals and the iteration dependent weights are given by:
The diagonal matrices (or simply the vectors holding the diagonal elements) are updated at each iteration and the system in (6.10) can be solved approximately via a few iterations of CG- or LSQR-based algorithms. Another advantage of the IRLS approach is that the powers p can be made component dependent. This then allows for better inversion of partially sparse signals (if of course, the location of the sparse part can be estimated with some accuracy). An example is illustrated in Figure 4 and further discussed in . Another approach discussed in  is based on a smooth approximation of the absolute value function f(t) = |t| obtained via convolution with a sequence of Gaussian kernels, which have approximately shrinking support. The resulting “conv-CG” method is suitable especially for rapid warm start acceleration.
5.4. Alternate penalty functions and regularization parameter selection
We mention here the classical image deconvolution problem. Given a blurring source g, such as a 2D Gaussian function, we can produce a blurry image s from an unaltered source f via convolution , where n is some additive noise component. For such situations, a TV (total variation) norm penalty is frequently used, for purposes of noise removal . For a 2-D signal (such as an image), the TV penalty can be written as . Sometimes, the alternate approximation is utilized. Various iterative schemes have been developed for such penalty functions .
Both the -based approaches and sparsity promoting regularization schemes (as well as TV-norm penalty functionals) utilize one or more regularization parameters. In the case of Tikhonov regularization with smoothing (as in (9)) more than one parameter is present. In this case, the second (smoothing) parameter can generally be set according to the desired smoothing effect, once the first parameter is chosen (with a fixed value of ) and then, can be adjusted to achieve a desired solution norm. Thus, we focus here on techniques to adjust , which we simply refer to as .
The standard way to choose the parameter is to use the L-curve technique starting at a large (generally a value close to is a good choice) and decreasing down in logarithmic fashion using the logarithmically spaced sequence:
The first is based on a target residual value, typically determined by the estimate of the noise norm. At every after the initial value we reuse the previous solution as the initial guess for the CG scheme at the current . We can use the solution xλ for which is closest to the desired residual level (or refine further the solution at this with more CG iterations).
If however, the target residual norm is not available, other techniques must be used. We discuss a method using the so-called L-curve where for the norm damping problem (7), we plot a curve composed of points which we can obtain using the same continuation procedure previously discussed. The curve represents the tradeoff between the residual value and the solution norm . In practice, neither of these quantities should dominate over the other. Hence, an established strategy is to look for the point of maximum curvature along the L-curve . If we set:
where xλ is the solution of (11) at the particular value of . We can then compute the curvature by the formula:
where the derivative quantities can be approximated via finite differences. We illustrate various plots for a synthetic example in Figure 5. In the residual plot, the target residual is taken to be the magnitude of the noise vector norm. We can also see that the lowest percent error between and the true x occurs at a value of λ roughly corresponding to the highest curvature of the L-curve. In fact, for this example, the curvature method gives a better estimate of good λ than the residual curve technique.
6. Nonlinear least squares (NLS) problems
In many cases, the inverse problem may be posed in terms of a nonlinear function F (x, t) with x a vector of variables, which may be time dependent with parameter t. We first describe, here, the popular Newton-Gauss method for NLS . Let with . Then, the NLS problem takes the form: . Setting , yields with Newton’s method:
Expanding the gradient and Hessian of g yields:
where The approximation is used in the expression for the Hessian in the Gauss-Newton method, yielding a simple iterative scheme:
Unfortunately, this method is not stable and will typically not converge if initialized far away from a minimum solution . Improvements include the introduction of a step size parameter:
and of the use of a regularizer (e.g., Levenberg-Marquardt method ): where the system is replaced by an -norm penalty regularized system, .
7. Low-rank matrix factorizations
In many applications, there are large matrices with rapidly decaying singular values. In such cases, low-rank matrix approximations like the low-rank SVD are useful for compression, speed gains, and data analysis purposes. For , the low-rank SVD of rank k (with k < min(m, n)) is the optimal matrix approximation of A in the spectral and Frobenius norms. Taking p = min(m, n), we define the low-rank SVD of rank k by Ak by taking into account only the first k < p singular values and vectors: that is, with consisting of the first k columns of consisting of k rows and columns of , and consisting of the first k columns of V:
By the Eckart-Young theorem :
when the error is measured in the spectral norm, and
in the Frobenius norm. When k ≪ p, the matrices , and are significantly smaller (cost of storage of all nonzeros is mk + nk + k) than the corresponding full SVD matrices U, Σ, and V (cost of storage is mp + np + p) and that of A (cost of storage is mn, but only some fraction of this if A is sparse). While the construction of Ak is expensive (requiring in most cases the SVD of A), it can be approximated very accurately via randomized algorithms, which requires only the SVD of a smaller matrix. Various randomized algorithms for constructing the low-rank SVD and related factorizations are described in . Techniques for computing the low-rank SVD of a matrix rely on a simple principal. An orthonormal matrix (with r = k + l where l is a small oversampling parameter, e.g., l = 5), is computed such that . If in fact r is large enough so that , the range of A is a subset of the range of Q. Thus, when , we expect the range of Q to capture a good portion of the range of A, a statement which can be made rigorous with some analysis. In this case, we form the smaller matrix , where , possibly much smaller than the m × n matrix A. Instead of performing the SVD on A, we can obtain the SVD of . If , then and the later will form a low-rank SVD approximation for A (if we only take the first k singular vectors and values of the corresponding factorization). Notice that when A is rectangular, the eigen-decomposition of the BBT or BT B matrices can be used to construct the approximate low-rank approximation of A.
A separate problem is the construction of a suitable matrix Q from A. Again, the idea is to construct as small (in terms of column number) as possible Q with orthonormal columns, such that Q captures a good chunk of the range of A. When A is a matrix of known rank k then (in MATLAB notation), simply setting where is a GIID matrix, with l a small over-sampling parameter, produces a valid matrix Q for projection. When the tail singular values (those smaller than σk) are still expected to be significant, a power sampling scheme turns out to be effective. Instead of setting Y = AΩ and performing QR of Y, we use the matrix with q ≥ 1. Plugging in the SVD of A, we obtain , which has the same eigenvectors as A but much faster decaying singular values. Care must be taken when taking powers of matrices, to prevent multiplying matrices whose singular values are greater than one in magnitude. However, when the rank of the matrix A is not known, it is hard to use this approach, since the optimal size of the random matrix Ω to use would not be known. In this situation, a blocked algorithm can be employed , where on output with user supplied parameter, an orthonormal matrix Q and matrix B are produced such that where B = QT A. Then, any number of standard low-rank matrix factorizations can be computed by operating on the matrix B instead of A. The basic steps of the proposed algorithm are given in Figure 6. We note that the resulting Q matrix can be utilized also for purposes of model reduction (e.g., one can use the reduced linear system QT Ax = QT b as an approximation to the full system Ax = b). That is, one can use the reduced linear system as an approximation to the full system (or to replace and with the projected values in the least squares formulation); which has applications to e.g. accelerate image deblurring. The construction of for large matrices can in practice be done in parallel by employing the algorithm in  over row blocks of the matrix, as illustrated in Figure 6.
The rank-k of a general matrix A yields an optimal approximation of rank k to A, both in the operator (spectral) and Frobenius norms. On the other hand, even if A is a sparse matrix, the and factors Uk and Vk are typically dense. This means that if the matrix is approximately p percent filled, the matrix will have approximately nonzeros. On the other hand, the rank k SVD will consist of approximately mk + k + nk nonzeros. For growing rank k, this quantity will quickly approach and even exceed N. Thus, even though the low-rank SVD is optimal for a given rank k, the choice of rank may be limited to relatively low values with respect to min(m, n) for sparse matrices, in order to achieve any useful compression ratios. (Of course, the usefulness of low-rank SVD representation is not simply limited to compression; indeed they are useful, e.g., for low-dimensional data projections; but the utility of a low-rank approximation is greatly reduced once the storage size of the factors exceeds that of the original matrix). Yet another aspect of the SVD which may be problematic is the difficulty in interpreting the eigenvectors present in Uk and Vk. While in many applications these have distinct meanings, they are not often easy to interpret for a particular data set.
It is thus plausible, in the above context, to look for factorizations which may not be optimal for rank k, but which may preserve useful properties of A such as sparsity and non-negativity, as well as allow easier interpretation of its components. Such properties may be found in the one- and two-sided interpolative decompositions and the CUR decomposition based on the pivoted QR decomposition. If we stop the QR procedure after the first k iterations, we obtain:
From this formulation, we set
where Tl is the solution to the matrix equation S11Tl = S12 (which if solved for Tl a column at a time, is simply a set of linear systems). It follows that we can write:
The one-sided ID of (rank k) is the approximate factorization:
where we use a partial column skeleton of a subset of the columns of A and V is a well-conditioned matrix. Clearly, C simply represents a subset of the columns of A chosen based on the pivoting strategy used in the QR factorization. The typical pivoting strategy is to choose the permutation matrix P (which simply dictates the re-arrangement of the columns of A) such that if S22 above is omitted, yielding:
then the components of satisfy . Several other pivoting strategies can be employed and each will yield a somewhat different re-arrangement of the columns of A.
Once the single-sided ID is defined, the two-sided ID can be constructed simply by obtaining a one-sided ID of A and that of AT. A set of select columns of AT obtained by this procedure, will be the same as the set of select rows of A. Thus, we can write the two-sided ID of (rank k) as:
The procedure for the construction of the interpolative decompositions can be accelerated by means of randomization, just like for the low-rank SVD. This is possible by virtue of the result below . Lemma 1 Let be a matrix with GIID entries. Then, for any , we have that and .
Lemma 1 Let be a matrix with GIID entries. Then, for any , we have that and .
Suppose, A is and we draw a GIID matrix . Suppose, we then form the matrix . Then, . As the pivoting result depends heavily on the ratio of the individual column norms of A with respect to one another, the above result tells us that the ratio of column norms is roughly preserved in a matrix resulting from the multiplication of the original matrix by a Gaussian random matrix from the left. As the product matrix consists of fewer rows than the original matrix, the pivoted QR factorization is correspondingly cheaper to perform on the product matrix Z than on A, while the resulting permutation matrix (really the re-arrangement vector) will be similar for both cases.
The two-sided ID allows us to construct the popular Column/Row skeleton CUR (rank k) decomposition:
Suppose, we compute a two-sided rank k ID factorization forming the column/row skeleton A(Jr(1: k), Jc(1: k)). Set:
We then set this to equal the factors C and R in CUR:
where we take U to satisfy the system UR = V T: In Figure 7, we compare the relative errors obtained with different approximations at the same rank. For matrices with mild singular value decay, the low-rank SVD obtained via a randomized scheme (with oversampling) gives significantly closer to optimal performance (to true truncated SVD) than other decompositions.
8. An introduction to Backus-Gilbert inversion
As previously mentioned, damped least-squares (DLS) techniques are commonly exploited to solve linear, discrete inversion problems, such as those encountered in seismic tomography [15, 16]. To break the nonuniqueness of the least-squares solution, DLS inversion schemes often rely on ad hoc regularization strategies (e.g., model norm damping or smoothing), designed to subjectively favor the model simplicity. However, in regions of poor seismic data coverage, DLS methods may lead to locally biased model solutions—potentially causing model misinterpretations . In other words, DLS models may represent “biased averages” over the true-model parameters. Most geotomographical studies suffer from uneven data coverages, and thus are concerned by these averaging bias effects. For example, teleseismic body-wave ray-paths irregularly sample the Earth’s interior, because earthquakes typically are concentrated along oceanic ridges or subduction zones and seismometers are located over continental areas or oceanic islands.
A fundamentally different approach is that of linear Backus-Gilbert inversion [18, 19, 20], which belongs to the class of optimally localized averages (OLA) methods. In the discrete version of the Backus-Gilbert theory, one aims at evaluating (weighted) averages of the true-model parameters. That is, the Backus-Gilbert method seeks to determine unbiased model estimates. Over the past half century, many authors have considered that, in addition to being computationally very intensive, it could be a clumsy affair in the presence of data errors to practically implement the Backus-Gilbert method to (large-scale) tomographic applications [15, 21, 22, 23]. In the following, we aim to describe a recently developed—and (finally!) computationally tractable—tomographic approach  based on the Backus-Gilbert philosophy.
The SOLA (subtractive optimally localized averages) method [24, 25] is a variant of the original Backus-Gilbert approach, which has been exploited to solve helioseismic inversion problems [26, 27]. As a remark, Pijpers and Thompson  termed this alternative the SOLA method, though it may have been rediscovered independently by different authors [28, 29]. SOLA retains all the advantages of the original Backus-Gilbert method, but is much more computationally efficient and versatile in the construction of resolving (averaging) kernels. Recently, SOLA has been introduced and adapted to large-scale, linear and discrete “tomographic” problems by Zaroli . We now briefly review the SOLA inversion scheme, tailored to seismic tomography.
In this section, let us slightly change the notations about linear inverse problems, to keep closer with those preferred in the geosciences community [10, 17]. Let us consider linear, discrete forward problems of the form:
where denotes the data, the sensitivity matrix, the true-model parameters, and the noise. The sensitivity matrix elements are the partial derivatives of the data with respect to the model parameters: . Typically, in “large-scale” tomographic studies, one may have to deal with model parameters and data. Let us consider, without loss of generality, that the data are time-residuals, the model parameters are velocity anomalies, the model space is parametrized using regular-size cells (local and “orthonormal” parameterization), the noise is randomly drawn from a normal distribution , and the data covariance matrix is . For local and “irregular” parametrizations, the reader is referred to . It is a common practice to normalize both the data and sensitivity matrix by the data errors; thus .
One aims to find a model estimate, , that can be expressed as a linear combination of the data:
where the matrix denotes some generalized inverse. The model estimate can be decomposed as
is often referred to as the model resolution matrix. The first term in right member of (25), , represents the filtered true model, and shows our inability, if , to perfectly recover the true model. Here, we refer to the k-th row of the resolution matrix, , as the resolving kernel that linearly relates the k-th parameter estimate, , to the true-model parameters:
Therefore, we wish that represents an unbiased averaging over the true model parameters, m. This means that, for any parameter index , we wish that . is non-negative and satisfies to
The second term in right member of (25), , denotes the propagated noise (i.e. the propagation of data errors) into the model estimate. Robust model interpretations require accurate appraisals of model estimates, that is to compute and carefully analyze both and the model covariance matrix
As a remark, for DLS models this would also mean to quantify averaging bias effects (if any)—see . The model estimate , resolution , and covariance can be inferred from the generalized inverse ; efficiently computing the full generalized inverse is then crucial for any linear inverse problem. As we shall see, in the “SOLA Backus-Gilbert” approach the generalized inverse is directly determined.
The original Backus-Gilbert scheme consists in constructing the most peak-shaped resolving kernel (peaked around each model parameter location), while moderating at most the propagated noise into the model estimate. The key idea in the SOLA method is to specify an a priori “target form” for each resolving (averaging) kernel. One needs to specify M target resolving-kernels (hereafter, target kernels) such that their spatial extent represents some a priori estimate of the spatial resolving-length (around each parameter location). As an example, for 2-D tomographic studies the simplest target form could be circular (isotropic resolving-length); each target kernel would be constant inside such a circle and zero outside. Rather than minimizing the spread of each resolving kernel, as in the original Backus-Gilbert formulation, in the SOLA approach one aims at minimizing the integrated squared difference between each resolving kernel and its associated target kernel. Each row of the SOLA generalized inverse is individually computed by solving a specific minimization problem—the full computation of is then extremely parallel. The k-th row, , is found such that:
where and are the k-th tradeoff parameter (resolution misfit versus model variance) and target resolving-kernel vector, respectively; k is the index of considered model parameter. Because of the additional constraint in (30), the k-th parameter estimate, , is expected to be unbiased (provided that its corresponding resolving kernel is (mostly) non-negative)—so for the model estimate . Though not strictly necessary, here all M target kernels are imposed to be unimodular:
The system to be solved for the k-th row of the SOLA generalized inverse then writes as follows:
As a remark, since only a single (k-th) parameter index is treated at a time in (32), it could be difficult to ensure that all M selected values for the tradeoff parameters () would lead to “globally coherent” model solutions. However, it seems [10, 17] that globally coherent tomographic images can be obtained when using: (1) target kernels whose size is tuned to the spatially irregular data coverage (for instance using seismic ray-paths density as a proxy for the spatial variations of the local resolving-length); and (2) constant-valued tradeoff parameters, that is:
In practice, it seems that may (roughly) be determined from analyzing a few curves of tradeoff between and , for some randomly chosen parameter index (k). Let us now define the following quantities [10, 17, 30]:
where c1 is assumed to be nonzero and denotes the Kronecker symbol. Solving (32) therefore consists in solving for the following normal equations:
using for instance the LSQR algorithm , and then to infer the final solution x(k) (i.e., the k-th row of the SOLA generalized inverse) from such that:
Last, but not least, we now aim to discuss about the computational efficiency of the SOLA approach for computing the full generalized inverse (see ). First, the rows of the generalized inverse matrix can be computed in parallel on P processors, so that computing all M rows would take CPU-time, where t is the average CPU-time to numerically solve (35). A crucial point is that the matrix , of size (M + 1) × (N − 1), does not depend on the parameter index (k), so that it does not need to be recomputed M times—as it was required in the original Backus-Gilbert approach (see ). The vector has to be recomputed M times, but that task is computationally cheap. and can easily be reconstructed if one aims at investigating different values (only the last row of and last element of depend on ). Finally, simply re-ordering the rows of the sensitivity matrix G (and corresponding data), such that the first row of G is the sparsest one, allows the matrix to be almost as sparse as G—this sparsity property is very useful when solving (35), in terms of storage, efficiency of the LSQR algorithm, and memory footprint.
Figure 8 shows an example of the SOLA method applied to global-scale seismic tomography , for which there are M = 38, 125 model parameters and N = 79, 765 data (teleseismic shear-wave time-residuals). Tomographic images represent isotropic, 3–D shear-wave velocity perturbations within the whole Earth’s mantle (with respect to some reference, radial absolute velocity model). Figure 8a and b displays the tomographic model , at about 600 km depth, and its uncertainty computed as
(since the data are normalized by their errors), respectively. The form of each target kernel is that of a 3-D spheroid, corresponding to a priori lateral and radial resolving lengths that may be expected locally, at best, given the data coverage. Let us focus on the k′-th model parameter, marked by a green dot in Figure 8; a zoom-in on the tomographic model is shown in Figure 8c. Horizontal (600 km depth) and vertical cross-sections through the k′-th target kernel are displayed in Figure 8d and e, respectively. The corresponding k′-th resolving (averaging) kernel is similarly displayed in Figure 8f and g.
Finally, the “SOLA Backus-Gilbert” approach, introduced and adapted to large-scale, linear, discrete tomographic problems by Zaroli , allows to efficiently compute unbiased models, including their full resolution and covariance—enabling quantitative model interpretations .
In this work, we have presented several techniques useful to the practitioner in the field of inverse problems, with the aim to give an idea of when and how these techniques should be employed for various linear and nonlinear applications. We have discussed techniques such as sparse matrix storage, the use of pivoted factorizations for direct solves, , and intermediate penalty-based regularization strategies, nonlinear least squares problems, the construction and use of low-rank factorizations, and an application of the Backus-Gilbert inversion approach tailored to seismic tomography.