Survey of computational methods for inverse problems

Inverse problems occur in a wide range of scientiﬁc applications, such as in the ﬁelds of signal processing, medical imaging or geophysics. This work aims to present to the ﬁeld practitioners, in an accessible and concise way, several established and newer cutting-edge computational methods used in the ﬁeld of inverse problems – and when and how these techniques should be employed.


Introduction
In this work, we aim to survey several techniques useful to a practitioner in the field of inverse problems, where the solution to a vector of interest is given through a linear system Ax = b or through a set of non-linear equations F (x) = 0. In our presentation below, we review both classical results and newer approaches, which the reader may not be familiar with. In particular, this chapter offers entries on the following material: • Matrix factorizations and sparse matrices

Notation
In this Chapter, we use the norm · to refer to the spectral or operator norm, and · p to refer to the p norm. We make frequent use of the QR decomposition and the SVD (Singular Value Decomposition). For any M × N matrix A and (ordered) subindex sets J r and J c , A(J r , J c ) denotes the submatrix of A obtained by extracting the rows and columns of A indexed by J r and J c , respectively; and A(:, J c ) denotes the submatrix of A obtained by extracting the columns of A indexed by J c . We will make use of the covariance and the variance matrix, which we define as follows in terms of the expected value: ) T ] and var(x) = cov(x, x).
By 'Diag', we refer to a diagonal matrix with nonzeros only on the diagonal. We make use of so called GIID matrices. These are matrices with independent and identically distributed draws from the Gaussian distribution. In the Octave environment (which we frequently reference), these can be obtained with the 'randn' command. We assume the use of real matrices although most techniques we describe extend to the complex case.

Matrix factorizations and sparse matrices
Let A be an M × N matrix with real or complex entries, and set r = min(M, N ). We will make use of the singular value decomposition (SVD) of a real matrix A ∈ R M ×N : if A is of rank r, then there exist U ∈ R m×r , V ∈ R N ×r and Σ ∈ R r×r such that Σ = Diag(σ 1 , σ 2 , . . . , σ r ) ∈ R r×r is a diagonal matrix with σ 1 ≥ σ 2 ≥ · · · ≥ σ r > 0, and This is known as the economic form of the SVD [9]. For 1 ≤ i ≤ min{M, N }, the i-th largest singular value of A is defined to be σ i , with σ j = 0 for j = r + 1, . . . , min{M, N } whenever r < min{M, N }. The generalized inverse of A ∈ R m×N with SVD A = U ΣV T , is defined as A + = V Σ −1 U T (and Σ −1 = Diag(σ −1 1 , σ −1 2 , . . . , σ −1 r ) ∈ R r×r ). By a rank deficient matrix, we imply a non-linear decay of singular values {σ i }. In this case, the numerical rank of A is may be smaller than the actual rank due to the use of finite precision arithmetic.

The (compact) QR-factorization of A takes the form
where P is a permutation matrix, Q has orthonormal columns, and R is upper triangular. The permutation matrix P can more efficiently be represented via a vector J c ∈ Z n + of indices such that P = I(:, J c ) where I is the n × n identity matrix. The factorization (4.1) can then be written as: Another commonly used decomposition is the pivoted LU: A(:, J c ) = L U. m × n m × r r × n (4. 3) with L a lower triangular and U an upper triangular matrix. In the Octave environment, these decompositions can be constructed, respectively, via the commands ([Q, R, I] = qr(A, 0); [L, U, I] = lu(A). The matrix P does not need to be explicitly formed. Instead, the vector I gives the permutation information. The relation between the two in Octave is given by the command P(:,I) = eye(length(I)).
Many matrices in applications turn out to be sparse. They can be stored more efficiently, without the need to store all m × n elements. The simplest sparse format is the so called coordinate sparse format, common to e.g. the Octave environment. In this format, we store the integers row, column, and floating point value for each nonzero of the sparse matrix A: a set of triplets of the form (i, j, v). However, we do not need to store all the row and column indices of the nonzero elements. Below, we summarize the two commonly used sparse formats for an example matrix.
The compressed column and row formats for this matrix are given by the vectors: In the compressed column format, the d c array stores the nonzero elements, scanned row by row. The array i c stores the row index of the corresponding data element, and the array p c stores the index of the start of each column in the data array d c . Similarly, for the compressed row format, all the column indices of nonzeros are given, but the row information is compressed by giving in p r , the index of the start of each row in d r . Moreover, if needed, the three vectors for the sparse representation above can be further compressed, with e.g. lossless compression techniques, such as arithmetic coding [6]. BLAS operations on sparse matrices can be performed directly using these storage formats. Below, we list the pseudocode for the operations y 1 = Ax 1 and y 2 = A T x 2 for an m × n sparse matrix A stored with compressed row format.
function y=mat mult (A, x ) y = zeros (m, 1 ) ; for i =1:m for j=pr ( i ) : pr ( i +1) y ( i ) = y ( i ) + dr ( j ) * x ( i r ( j ) ) ; end end end function y=m a t t r a n s m u l t (A, x ) y = zeros ( n , 1 ) ; f o r i =1:m f o r j=pr ( i ) : pr ( i +1) y ( i r ( j ) ) = y ( i r ( j ) ) + dr ( j ) * x ( i ) ; end end end 5 Direct solves Given a linear system Ax = b with a square matrix A which is invertible (det(A) = 0), the solution x can be constructed through the inverse of A, built up using Gaussian elimination. For relatively small systems, the construction of such solutions is often desired over least squares formulations, when a solution is known to exist. Typically elimination is used to construct the factorization of A into a QR or LU decomposition. The construction of factorizations (QR, LU) with column pivoting can be applied to system solves involving rank deficient matrices. As an example, consider the pivoted QR factorization AP = QR. Here AP = A(:, I), a rearrangement of the columns of A upon multiplication with permutation matrix P .
which is an upper triangular system, and can be solved by back substitution. A simple permutation P T x = y ⇒ x = P y yields the solution x.
Similarly, suppose we have the pivoted LU factorization AP = LU . Then plugging into Ax = b yields LU P T x = b. Next, set z = U P T x = U y with y = P T x. Then Lz = b (which is a lower triangular system) can be solved by forward substitution for z, while U y = z can be solved by back substitution for y. Again applying a permutation matrix to x = P y yields the result. Notice that multiplying by P can be done efficiently, simply be re-arranging the elements of y. The implementations of the back substitution and forward substitution algorithms are given below.

Least Squares
Prior to discussing two-norm or what is more commonly known as Tikhonov regularization, we mention the least squares problem:x lsq = arg min This formulation arises due to the noise in the right hand side b, in which case it does not make sense to attempt to solve the system Ax = b directly. Instead, if we have an estimate for the noise norm (that is, b =b + e with unknown noise vector e, but we can estimate e 2 ), then we could seek a solution x such that Ax − b 2 ≈ e 2 . Let us now look at the solution of (6.1) in more detail. As the minimization problem is convex and the functional quadratic, we obtain the minimum by setting the gradient of the functional to zero: A common choice of solution to these quadratic equations (6.2) would be directly through the generalized inverse: because of all the solutions to A T Ax = A T b, A + b has the smallest 2 -norm: Typically, the least squares problem is solved by an iterative method such as cojugate gradients (CG) or related techniques such as LSQR. Whichever way the solution to the normal equations in (6.2) is obtained, it will be close A + b and share its properties.
First, if A has small singular values, the norm of the solution A + b = V Σ −1 U T b will be very large because of the Σ −1 matrix. Another disadvantage of (6.1) is that the solution A + b will be very sensitive to any noise in b or even in A (if any approximations in the calculations are used). Suppose the noise vector e behaves like white noise. Its different entries are uncorrelated, each having mean 0 and standard deviation ν. If in addition the elements ofb and e are uncorrelated we have: We may then estimate the norm of the variance of the solution vector: where σ min is the smallest magnitude singular value of A. We can clearly see that when A has an appreciable decay of singular values (such that σ min is small relative to σ 1 ), the solution x = A + b will be sensitive to data errors. For these reasons, adding additional terms (regularization) to the optimization problem is often necessary.

Tikhonov Regularization
Having discussed the least squares approach we turn our attention to the simplest form of Tikhonov Regularization:x tik = arg min where λ > 0 is a scalar regularization parameter which controls the tradeoff between the solution norm x 2 and the residual fit Ax − b 2 . Since the functional in brackets is convex, we can get the solution by again setting the gradient to zero: If we plug in the SVD of A = U ΣV T , we get: We see that the effect of the regularization is to filter the small singular values σ i , by replacing each σ i by σ i σ 2 i +λ , which prevents the singular values smaller than λ from dominating the solution. If we now compute the norm of the solution variance, we obtain: The result follows because the function h(t) : λ . Thus, the solution to the system from (6.5) (even if obtained from a CG type scheme after a finite number of iterations) relieves the problems due to small singular values and noise which effect the solution from (6.2).
In practice, a slight generalization of (6.4) is performed by adding a regularization matrix L: x tikL = arg min Generally L is some kind of sharpening (difference) operator. The penalization of the term Lx thus results in smoothing. If the coordinate system x is one dimensional, it could take the form: When the coordinate system corresponding to the model vector x is higher dimensional, L will be correspondingly more complicated and will generally have block structure. Note that the solution to (6.6) can be obtained through the linear equations: and can also be cast as an augmented least squares system and solved through its corresponding augmented normal equations: This is an important point with regards to implementation, as it means that codes which solve the normal equations for standard least squares can be readily modified to solve (6.6). If L is a smoothing operator, the parameters λ 1 and λ 2 effect the degree of norm damping and model smoothing, respectively. Notice that increasing λ 2 from zero also changes the solution norm, so λ 1 may need to be altered to compensate.

Sparse regularization and generalized functional
The 2 penalty in (6.4) has the effect of penalizing the solution norm in a way which encourages all coefficients to be small (as λ is increased). For very large λ, only x = 0 would result in the required minimum but for modest values (below e.g. A T b ∞ ), the effect would be to force all solution coefficients to have smaller magnitudes with increasing λ, but would not make any of them explicitly zero. In many applications, a sparse solution is sought (where a significant percentage of the coefficients of x are zero). A so-called 0 measure is an indicator function for the number of nonzeros of x. This measure is not a norm, as it doesn't satisfy e.g. the basic triangle inequality. Constraining the 0 measure (e.g. Ax − b < , min x 0 ), leads to a combinatorially difficult optimization problem.
A key insight of Compressive Sensing is that the minimization with respect to the 0 measure and the convex 1 norm (the sum of the absolute values of the components of a vector) produce the same result (the sparsest solution) under some strong conditions (i.e. RIP, Restricted Isometry Property) on A [7]. In practice, a non-random A from a physical application would not satisfy the RIP conditions. On the other hand, the minimization with respect to the 1 norm (and the p -norms for 0 < p < 2, convex only for p ≥ 1) produce sparse solutions (but not-necessarily the sparsest one at a given residual level). Sample illustrations for the 2d case are given in Figure 2, where we can observe that in two dimensions, the minimization of the p norm for p ≤ 1 results in one of the two components equal to zero. To account for the possibility of employing a sparse promoting penalty and also for more general treatment of the residual term, which we discuss more below, we will consider the two parameter functional [28]: withF p = F 2,p (with l = 2 being the most-common residual based penalty). For p < 2, the functional F p is not differentiable. This means that the minimum value cannot be obtained by setting the gradient equal to zero as for 2 based minimization. A particularly well studied example is the 1 case, which is the closest convex problem to the 0 penalty. Convexity ofF 1 (x) guarantees that any local minimizer is necessarily a global one. In this case, an algorithm can be constructed which decreases the functional value and tends to the (global) minimizer ofF 1 (x). One such method is called the iterative soft thresholding algorithm (ISTA) and relies on the soft thresholding function S τ (x), defined as: The benefit of this function is two-fold: it explicitly sets small components of x to zero (promoting sparsity) and is continuous (unlike the related hard thresholding function which simply zeros out all components smaller than τ in absolute value). The soft thresholding function satisfies a useful identity: which is utilized with a surrogate functional approach to construct the ISTA scheme: This algorithm converges to the 1 minimizer for any initial guess and with A 2 (the spectral norm of A) being less than 1 (which can be accomplished simply by rescaling). In practice, a much faster converging scheme called FISTA (Fast ISTA) [5] is utilized, which is a slight reformulation of ISTA applied to a linear combination of two previous iterates: where t 1 = 1. This algorithm is simple to implement and readily adapts to parallel architectures. For challenging problems (such as when the decay of singular values of A is rapid), the thresholding function S τ can be slightly altered (see Figure 3), but the same iterative scheme (6.9) can be utilized. Two possible approaches are either to vary the thresholding, starting from soft thresholding and slowly approaching the (discontinuous) hard thresholding function, or to use a function which better mimics hard thresholding away from zero. The use of different thresholding functions alters the optimization problem being solved. Thresholding based techniques are simple to implement but are Figure 3: Illustrations of different thresholding functions [27,24].
not effective in all situations, particularly when only a few iterations are feasible (for example, when A is large). In this case, two interesting approaches are iteratively re-weighted least squares (IRLS) and convolution smoothing [28]. Both techniques replace the non-smooth part of the functional (namely the absolute value function |x|) by a smooth approximation. Moreover, both techniques have the particular advantage of being able to employ gradient based methods (such as CG) at each iteration, considerably increasing the per-iteration performance. The IRLS approach is based on the approximation: where in the rightmost term, a small = 0 is used, to insure the denominator is finite, regardless of the value of x k . The resulting algorithm [28] for the minimization of (6.8) can be written as: with two diagonal iteration dependent matrices D n and R n . The diagonal matrix D n has elements » 1 2 λpw n k and R n has diagonal elements l|r n i | l−2 (for i where |r n i | < , we can set the entry to l l−2 with the choice of user controllable, tuned for a given application). Here, the residuals r n i = (Ax n − b) i and the iteration dependent weights are given by: The diagonal matrices (or simply the vectors holding the diagonal elements) are updated at each iteration and the system in (6.10) can be solved approximately via a few iterations of CG or LSQR based algorithms. Another advantage of the IRLS approach is that the powers p can be made component dependent. This then allows for better inversion of partially sparse signals (if of course, the location of the sparse part can be estimated with some accuracy). An example is illustrated in Figure 4 below, further discussed in [25]. Another approach discussed in [28] is based on a smooth approximation of the absolute value function f (t) = |t| obtained via convolution with a sequence of Gaussian kernels, which have approximately shrinking support. The resulting 'conv-CG' method is suitable especially for rapid warm start acceleration.

Alternate penalty functions and regularization parameter selection
We mention here the classical image deconvolution problem. Given a blurring source g, such as a 2-D Gaussian function, we can produce a blurry image s from an unaltered source f via convolution s = f g + n, where n is some additive noise component. For such situations, a TV (total variation) norm penalty is frequently used, for purposes of noise removal [29]. For a 2-D signal (such as an image), the TV penalty can be written as V (s) = i,j » |s i+1,j − s i,j | 2 + |s i,j+1 − s i,j | 2 . Sometimes, the alternate approximation i,j |s i+1,j − s i,j | + |s i,j+1 − s i,j | is utilized. Various iterative schemes have been developed for such penalty functions [29].
Both the 2 based approaches and sparsity promoting regularization schemes (as well as TVnorm penalty functionals) utilize one or more regularization parameters. In the case of Tikhonov regularization with smoothing (as in (6.6)) more than one parameter is present. In this case, the second (smoothing) parameter can generally be set according to the desired smoothing effect, once the first parameter λ 1 is chosen (with a fixed value of λ 2 ) and then, λ 1 can be adjusted to achieve a desired solution norm. Thus, we focus here on techniques to adjust λ 1 , which we simply refer to as λ.
The standard way to choose the parameter is to use the L-curve technique starting at a large λ (generally a value close to A T b ∞ is a good choice) and decreasing down in logarithmic fashion using the logarithmically spaced sequence: The parameter N can vary by application but is typically in the range [5,30]. Two typical strategies for parameter select are employed.
The first is based on a target residual value, typically determined by the estimate of the noise norm. At every λ after the initial value we reuse the previous solution as the initial guess for the CG scheme at the current λ. We can use the solution x λ for which Ax λ − b is closest to the desired residual level (or refine further the solution at this λ with more CG iterations).
If however, the target residual norm is not available, other techniques must be used. We discuss a method using the so-called L-curve where for the norm damping problem (6.4), we plot a curve composed of points (log Ax λ − b , log x λ ) which we can obtain using the same continuation procedure previously discussed. The curve represents the trade-off between the residual value Ax λ − b and the solution norm x λ . In practice, neither of these quantities should dominate over the other. Hence, an established strategy is to look for the point of maximum curvature along the L-curve [10]. If we set:¯ = log x λ p andρ = log Ax λ − b l , (6.12) where x λ is the solution of (6.8) at the particular value of λ. We can then compute the curvature by the formula:c where the derivative quantities can be approximated via finite differences. We illustrate various plots for a synthetic example in Figure 5. In the residual plot, the target residual is taken to be the magnitude of the noise vector norm. We can see also, that the lowest percent error betweenx λ and the true x occurs at a value of λ roughly corresponding to the highest curvature of the L-curve. In fact, for this example the curvature method gives a better estimate of good λ than the residual curve technique.

Non-linear least squares (NLS) problems
In many cases the inverse problem may be posed in terms of a nonlinear function F (x, t) with x a vector of variables, which may be time dependent with parameter t. We first describe here the popular Newton-Gauss method for NLS [9]. Let g(x) = 1 2 r(x) 2 with r i (x) = y i − F (x, t i ). Then the NLS problem takes the form:x = arg min x g(x). Setting ∇g(x) = 0, yields with Newton's method: Expanding the gradient and Hessian of g yields: The approximation T (x) ≈ 0 for the Hessian is what characterizes the Gauss-Newton method, yielding a simple iterative scheme: x n+1 = x n − î J T n J n ó −1 J T n r n . Unfortunately, this method is not stable and will typically not converge if initialized far away from a minimum solution [9]. Improvements include the introduction of a step size parameter: and of the use of a regularizer (e.g. Levenberg-Marquard method [9]): where the system J T n J n y = J T n r n is replaced by an 2 -norm penalty regularized system, (J T n J n + λI)ỹ = J T n r n .

Low rank matrix factorizations
In many applications, there are large matrices with rapidly decaying singular values. In such cases, low rank matrix approximations like the low rank SVD are useful for compression, speed gains, and data analysis purposes. For A ∈ R m×n , the low rank SVD of rank k (with k < min(m, n)) is the optimal matrix approximation of A in the spectral and Frobenius norms. Taking p = min(m, n), we define the low rank SVD of rank k by A k by taking into account only the first k < p singular values and vectors: that is, with U k ∈ R m×k consisting of the first k columns of U , Σ k = Diag(σ 1 , . . . , σ k ) ∈ R k×k consisting of k rows and columns of Σ, and V k ∈ R n×k consisting of the first k columns of V : By the Eckart-Young theorem [8]: when the error is measured in the spectral norm, and in the Frobenius norm. When k p, the matrices U k , Σ k , and V k are significantly smaller (cost of storage of all nonzeros is mk + nk + k) than the corresponding full SVD matrices U , Σ, and V (cost of storage is mp + np + p) and that of A (cost of storage is mn, but only some fraction of this if A is sparse). While the construction of A k is expensive (requiring in most cases the SVD of A), it can be approximated very accurately via randomized algorithms, which requires only the SVD of a smaller matrix. Various randomized algorithms for constructing the low rank SVD and related factorizations are described in [26]. Techniques for computing the low rank SVD of a matrix rely on a simple principal. An orthonormal matrix Q ∈ R m×r (with r = k + l where l is a small oversampling parameter, e.g. l = 5), is computed such that QQ T A ≈ A. If in fact r is large enough so that QQ T A = A, the range of A is a subset of the range of Q. Thus, when QQ T A ≈ A, we expect the range of Q to capture a good portion of the range of A, a statement which can be made rigorous with some analysis. In this case, we form the smaller matrix B = Q T A, where B ∈ R r×n , possibly much smaller than the m × n matrix A. Instead of performing the SVD on A, we can obtain the SVD of B = U ΣV T . If A ≈ QQ T A = QB, then A ≈ (QU )ΣV T and the later will form a low rank SVD approximation for A (if we only take the first k singular vectors and values of the corresponding factorization). Notice that when A is rectangular, the eigen-decomposition of the BB T or B T B matrices can be used to construct the approximate low rank approximation of A.
A separate problem is the construction of a suitable matrix Q from A. Again, the idea is to construct as small (in terms of column number) as possible Q with orthonormal columns, such that Q captures a good chunk of the range of A. When A is a matrix of known rank k then (in Matlab notation), simply setting Q = qr(AΩ, 0) where Ω ∈ R n×(k+l) is a GIID matrix, with l a small oversampling parameter, produces a valid matrix Q for projection. When the tail singular values (those smaller than σ k ) are still expected to be significant, a power sampling scheme turns out to be effective. Instead of setting Y = AΩ and performing QR of Y , we use the matrix Y = (AA T ) q AΩ with q ≥ 1. Plugging in the SVD of A, we obtain (AA T ) q A = U Σ 2q+1 V T , which has the same eigenvectors as A but much faster decaying singular values. Care must be taken when taking powers of matrices, to prevent multiplying matrices whose singular values are greater than one in magnitude. However, when the rank of the matrix A is not known, it is hard to use this approach, since the optimal size of the random matrix Ω to use would not be known. In this situation, a blocked algorithm can be employed [14], where on output with user supplied > 0 parameter, an orthonormal matrix Q and matrix B are produced such that QB − A < where B = Q T A. Then any number of standard low rank matrix factorizations can be computed by operating on the matrix B instead of A. The basic steps of the proposed algorithm are given in Figure 6. We note that the resulting Q matrix can be utilized also for purposes of model reduction (e.g. one can use the reduced linear system Q T Ax = Q T b as an approximation to the full system Ax = b).
The rank-k SVD (A k = U k Σ k V T k ) of a general m × n matrix A yields an optimal approximation of rank k to A, both in the operator (spectral) and Frobenius norms. On the other hand, even if A is a sparse matrix, the m × k and n × k factors U k and V k are typically dense. This means that function [Q, B] = randQB pb(A, ε, q, b) (1) for i = 1, 2, 3, . . .
if the matrix is approximately p percent filled, the matrix will have approximately N = [ p 100 m × n] nonzeros. On the other hand, the rank k SVD will consist of approximately mk + k + nk nonzeros. For growing rank k, this quantity will quickly approach and even exceed N . Thus, even though the low rank SVD is optimal for a given rank k, the choice of rank may be limited to relatively low values with respect to min(m, n) for sparse matrices, in order to achieve any useful compression ratios. (Of course, the usefulness of low rank SVD representation is not simply limited to compression; indeed they are useful e.g. for low dimensional data projections; but the utility of a low rank approximation is greatly reduced once the storage size of the factors exceeds that of the original matrix). Yet another aspect of the SVD which may be problematic is the difficulty in interpreting the eigenvectors present in U k and V k . While in many applications these have distinct meanings, they are not often easy to interpret for a particular data set.
It is thus plausible, in the above context, to look for factorizations which may not be optimal for rank k, but which may preserve useful properties of A such as sparsity and non-negativity, as well as allow easier interpretation of its components. Such properties may be found in the one and two sided interpolative decompositions and the CUR decomposition based on the pivoted QR decomposition. If we stop the QR procedure after the first k iterations, we obtain: From this formulation, we set C := A(:, J c (1 : k)) = Q 1 S 11 .
where T l is the solution to the matrix equation S 11 T l = S 12 (which if solved for T l a column at a time, is simply a set of linear systems). It follows that we can write: The one sided ID of (rank k) is the approximate factorization: where we use a partial column skeleton C = A(:, J c (1 : k)) of a subset of the columns of A and V is a well-conditioned matrix. Clearly, C simply represent a subset of the columns of A chosen based on the pivoting strategy used in the QR factorization. The typical pivoting strategy is to choose the permutation matrix P (which simply dictates the re-arrangement of the columns of A) such that if S 22 above is omitted, yielding: then the components ofS satisfy |s 11 | ≥ |s 22 | ≥ · · · ≥ |s nn |. Several other pivoting strategies can be employed and each will yield a somewhat different re-arrangement of the columns of A.
Once the single sided ID is defined, the two sided ID can be constructed simply by obtaining a one sided ID of A and that of A T . A set of select columns of A T obtained by this procedure, will be the same as the set of select rows of A. Thus, we can write the two sided ID of (rank k) as: The procedure for the construction of the interpolative decompositions can be accelerated by means of randomization, just like for the low rank SVD. This is possible by virtue of the result below [26].
Lemma 8.1 LetΩ ∈ R l×m be a matrix with GIID entries. Then for any a ∈ R m we have that E Ω a 2 a 2 = l and V ar Ω a 2 a 2 = 2l.
Suppose A is m × n and we draw an l × m GIID matrixΩ. Suppose we then form the l × n matrix Z =ΩA. Then, E Z(:,j) 2 A(:,j) 2 = l. As the pivoting result depends heavily on the ratio of the individual column norms of A with respect to one another, the above result tells us that the ratio of column norms is roughly preserved in a matrix resulting from the multiplication of the original matrix by a Gaussian random matrix from the left. As the product matrix consists of fewer rows than the original matrix, the pivoted QR factorization is correspondingly cheaper to perform on the product matrix Z than on A, while the resulting permutation matrix (really the re-arrangement vector) will be similar for both cases.
The two sided ID allows us to construct the popular Column/Row skeleton CUR (rank k) decomposition: Suppose we compute a two sided rank k ID factorization forming the k × k column/row skeleton A(J r (1 : k), J c (1 : k)). Set: We then set this to equal the factors C and R in CUR: where we take U to satisfy the system U R = V T : In Figure 7 below, we compare the relative errors obtained with different approximations at the same rank. For matrices with mild singular value decay, the low rank SVD obtained via a randomized scheme (with oversampling) gives significantly closer to optimal performance (to true truncated SVD) than other decompositions.

An introduction to Backus-Gilbert inversion
As previously mentioned, damped least-squares (DLS) techniques are commonly exploited to solve linear, discrete inversion problems, such as those encountered in seismic tomography [17,1]. To break the non-uniqueness of the least-squares solution, DLS inversion schemes often rely on ad hoc regularization strategies (e.g., model norm damping or smoothing), designed to subjectively favour the model simplicity. However, in regions of poor seismic data coverage, DLS methods may lead to locally biased model solutions -potentially causing model misinterpretations [31]. In other words, DLS models may represent 'biased averages' over the true-model parameters. Most geotomographical studies suffer from uneven data coverages, and thus are concerned by these averaging bias effects. For example, teleseismic body-wave ray-paths irregularly sample the Earth's interior, because earthquakes typically are concentrated along oceanic ridges or subduction zones and seismometers are located over continental areas or oceanic islands.
A fundamentally different approach is that of linear Backus-Gilbert inversion [2,3,4], which belongs to the class of Optimally Localized Averages (OLA) methods. In the discrete version of the Backus-Gilbert theory, one aims at evaluating (weighted) averages of the true-model parameters.
That is, the Backus-Gilbert method seeks to determine unbiased model estimates. Over the past half century, many authors have considered that, in addition to being computationally very intensive, it could be a clumsy affair in the presence of data errors to practically implement the Backus-Gilbert method to (large-scale) tomographic applications [15,19,23,1]. In the following, we aim to describe a recently developed -and (finally!) computationally tractable -tomographic approach [30] based on the Backus-Gilbert philosophy.
The SOLA (Subtractive Optimally Localized Averages) method [20,21] is a variant of the original Backus-Gilbert approach, which has been exploited to solve helioseismic inversion problems [22,11]. As a remark, Pijpers and Thompson [20] termed this alternative the SOLA method, though it may have been rediscovered independently by different authors [13,12]. SOLA retains all the advantages of the original Backus-Gilbert method, but is much more computationally efficient and versatile in the construction of resolving (averaging) kernels. Recently, SOLA has been introduced and adapted to large-scale, linear and discrete 'tomographic' problems by Zaroli [30]. We now briefly review the SOLA inversion scheme, tailored to seismic tomography.
In this section, let us change slightly the notations about linear inverse problems, to keep closer with those preferred in the geosciences community [30,31]. Let us consider linear, discrete forward problems of the form: where d = (d i ) 1≤i≤N denotes the data, G = (G ij ) 1≤i,j≤N,M the sensitivity matrix, m = (m j ) 1≤j≤M the true-model parameters, and n = (n i ) 1≤i≤N the noise. The sensitivity matrix elements are the partial derivatives of the data with respect to the model parameters: G ij = ∂d i /∂m j . Typically, in 'large-scale' tomographic studies, one may have to deal with M 10 5 model parameters and N 10 6 data. Let us consider, without loss of generality, that the data are time-residuals, the model parameters are velocity anomalies, the model space is parametrized using regular-size cells (local and 'orthonormal' parameterization), the noise is randomly drawn from a normal distribution N (0, σ n ), and the data covariance matrix is C d = σ 2 n I N . For local and 'irregular' parametrizations, the reader is referred to [30]. It is common practice to normalize both the data and sensitivity matrix by the data errors; thus C d = I N .
One aims to find a model estimate,m, that can be expressed as a linear combination of the data: is often referred to as the model resolution matrix. The first term in right member of (9.3),Rm, represents the filtered true model, and shows our inability, ifR = I M , to perfectly recover the true model. Here, we refer to the k-th row of the resolution matrix,R k. = (R kj ) 1≤j≤M , as the resolving kernel that linearly relates the k-th parameter estimate,m k , to the true-model parameters: kj m j , (ignoring the term of propagated noise) . (9.5) Therefore, we wish thatRm represents an unbiased averaging over the true model parameters, m. This means that, for any parameter index k ∈ [1, · · · , M ], we wish thatR k. is non-negative and satisfies to M j=1R kj = 1 .
The second term in right member of (9.3),Ĝ † n, denotes the propagated noise (i.e. the propagation of data errors) into the model estimate. Robust model interpretations require accurate appraisals of model estimates, that is to compute and carefully analyze bothR and the model covariance matrix As a remark, for DLS models this would also mean to quantify averaging bias effects (if any) -see [31]. The model estimatem, resolutionR, and covariance Cm can be inferred from the generalized inversê G † ; efficiently computing the full generalized inverse is then crucial for any linear inverse problem. As we shall see, in the 'SOLA Backus-Gilbert' approach the generalized inverse is directly determined.
The original Backus-Gilbert scheme consists in constructing the most peak-shaped resolving kernel (peaked around each model parameter location), while moderating at most the propagated noise into the model estimate. The key idea in the SOLA method is to specify an a priori 'target form' for each resolving (averaging) kernel. One needs to specify M target resolving-kernels (hereafter, target kernels) such that their spatial extent represents some a priori estimate of the spatial resolving-length (around each parameter location). As an example, for 2-D tomographic studies the simplest target form could be circular (isotropic resolving-length); each target kernel would be constant inside such a circle and zero outside. Rather than minimizing the spread of each resolving kernel, as in the original Backus-Gilbert formulation, in the SOLA approach one aims at minimizing the integrated squared difference between each resolving kernel and its associated target kernel. Each row of the SOLA generalized inverse is individually computed by solving a specific minimization problem -the full computation ofĜ † is then extremely parallel. The k-th row,Ĝ † k. = (Ĝ † ki ) 1≤i≤N , is found such that: where η k and t (k) = (T (k) j ) 1≤j≤M are the k-th trade-off parameter (resolution misfit versus model variance) and target resolving-kernel vector, respectively; k is the index of considered model parameter. Because of the additional constraint in (9.8), the k-th parameter estimate,m k , is expected to be unbiased (provided that its corresponding resolving kernel is (mostly) non-negative) -so for the model estimatem. Though not strictly necessary, here all M target kernels are imposed to be unimodular: (9.9) The system to be solved for the k-th row of the SOLA generalized inverse then writes as follows: As a remark, since only a single (k-th) parameter index is treated at a time in (9.10), it could be difficult to ensure that all M selected values for the trade-off parameters (η (k) ) would lead to "globally coherent" model solutions. However, it seems [30,31] that globally coherent tomographic images can be obtained when using: 1) Target kernels whose size is tuned to the spatially irregular data coverage (for instance using seismic ray-paths density as a proxy for the spatial variations of the local resolving-length); and 2) Constant-valued trade-off parameters, that is: In practice, it seems that η may (roughly) be determined from analyzing a few curves of trade-off between j (R kj − T (k) j ) 2 and σ 2 m k , for some randomly chosen parameter index (k). Let us now define the following quantities [16,30,31]: where c 1 is assumed to be non-zero and δ denotes the Kronecker symbol. Solving (9.10) therefore consists in solving forx (k) the following normal equations: using for instance the LSQR algorithm [18], and then to infer the final solution x (k) (i.e., the k-th row of the SOLA generalized inverse) fromx (k) such that: x (k) = Bx (k) + c −1 1 e 1 . (9.14) Last, but not least, we now aim to discuss about the computational efficiency of the SOLA approach for computing the full generalized inverse (see [30]). Firstly, the rows of the generalized inverse matrix can be computed in parallel on P processors, so that computing all M rows would take t × M/P CPU-time, where t is the average CPU-time to numerically solve (9.13). A crucial point is that the matrix Q (η) , of size (M + 1) × (N − 1), does not depend on the parameter index (k), so that it does not need to be recomputed M times -as it was required in the original Backus-Gilbert approach (see [30]). The vector y (k,η) has to be recomputed M times, but that task is computationally cheap. Q (η) and y (k,η) can easily be reconstructed if one aims at investigating different η values (only the last row of Q (η) and last element of y (k,η) depend on η). Finally, simply re-ordering the rows of the sensitivity matrix G (and corresponding data), such that the first row of G is the sparsest one, allows the matrix Q (η) to be almost as sparse as G -this sparsity property is very useful when solving (9.13), in terms of storage, efficiency of the LSQR algorithm, and memory footprint. Figure 8 shows an example of the SOLA method applied to global-scale seismic tomography [30], for which there are M = 38, 125 model parameters and N = 79, 765 data (teleseismic shear-wave timeresiduals). Tomographic images represent isotropic, 3-D shear-wave velocity perturbations within the whole Earth's mantle (with respect to some reference, radial absolute velocity model). Figs. 8(a,b) display the tomographic modelm, at about 600 km depth, and its uncertainty σm computed as (since the data are normalized by their errors), respectively. The form of each target kernel is that of a 3-D spheroid, corresponding to a priori lateral and radial resolving lengths that may be expected locally, at best, given the data coverage. Let us focus on the k -th model parameter, marked by a green dot in Fig. 8; a zoom-in on the tomographic model is shown in Fig. 8(c). Horizontal (600 km depth) and vertical cross-sections through the k -th target kernel are displayed in Figs. 8(d,e), respectively. The corresponding k -th resolving (averaging) kernel is similarly displayed in Figs. 8(f,g).
Finally, the 'SOLA Backus-Gilbert' approach, introduced and adapted to large-scale, linear, discrete tomographic problems by Zaroli [30], allows to efficiently compute unbiased models, including their full resolution and covariance -enabling quantitative model interpretations [31].

Conclusion
In this work, we have presented several techniques useful to the practitioner in the field of inverse problems, with the aim to give an idea of when and how these techniques should be employed for various linear and non-linear applications. We have discussed techniques such as sparse matrix storage, the use of pivoted factorizations for direct solves, 2 , 1 and intermediate penalty based regularization strategies, non-linear least squares problems, the construction and use of low rank factorizations, and an application of the Backus-Gilbert inversion approach tailored to seismic tomography.