## Abstract

In this chapter, we describe different methods to estimate an unknown signal from its linear measurements. We focus on the underdetermined case where the number of measurements is less than the dimension of the unknown signal. We introduce the concept of signal sparsity and describe how it could be used as prior information for either regularized least squares or Bayesian signal estimation. We discuss compressed sensing and sparse signal representation as examples where these sparse signal estimation methods could be applied.

### Keywords

- inverse problems
- signal estimation
- regularization
- Bayesian methods
- signal sparsity

## 1. Introduction

In engineering and science, a system typically refers to a physical process whose outputs are generated due to some inputs [1, 2]. Examples of systems include measuring instruments, imaging devices, mechanical and biomedical devices, chemical reactors and others. A system could be abstracted as a block diagram,

where **x** and **y** represent the inputs and outputs of the system, respectively. The block, **A**, formalizes the relation between these inputs and the outputs using mathematical equations [2, 3]. Depending on the nature of the system, the relation between its inputs and outputs could be either linear or nonlinear. For a linear relation, the system is called a *linear system* and it would be represented by a set of linear equations [3, 4]

In this chapter, we will restrict our attention to linear systems, as they could adequately represent many actual systems in a mathematically tractable way.

When dealing with systems, two typical types of problems arise, forward and inverse problems.

### 1.1. Forward problems

In a *forward problem*, one would be interested in obtaining the output of a system due to a particular input [5, 6]. For linear systems, this output is the result of a simple matrix-vector product, **Ax**. Forward problems usually become more difficult as the number of equations increases or as uncertainties about the inputs, or the behavior of the system, are present [6].

### 1.2. Inverse problems

In an *inverse problem*, one would be interested in inferring the inputs to a system **x** that resulted in observed outputs, i.e., measured **y** [5, 6]. Another formulation of an inverse problem is to identify the behavior of the system, i.e., construct **A**, from knowledge of different input and output values. This problem formulation is known as *system identification* [1, 7, 8]. In this chapter, we will only consider the input inference problem. The nature of the input **x** to be inferred further leads to two broad categories of this problem: *estimation*, and *classification*. In input estimation, the input could assume an infinite number of possible values [4, 9], while in input classification the input could assume only a finite number (usually small) of possible values [4, 9]. Accordingly, in input classification, one would like to only assign an input to a predetermined signal class. In this chapter, we will only focus on estimation problems, particularly on restoring an input signal **x** from noisy data **y** that is obtained using a linear measuring system represented by a matrix **A**.

## 2. Signal restoration as example of an inverse problem

To solve the above signal restoration problem, we need to estimate input signal **x** through the inversion of matrix **A**. This could be a hard problem because in many cases the inverse of **A** might not exist, or the measurement data, **y**, might be corrupted by noise. The existence of the inverse of **A** depends on the number of acquired independent measurements relative to the dimension of the unknown signal [5, 10]. The conditions for the existence of a stable solution of any inverse problem, i.e., for an inverse problem to be well-posed, have been addressed by Hadamard as:

*Existence*: for measured output**y**there exists at least one corresponding input**x**.*Uniqueness*: for measured output**y**there exists only one corresponding input**x**.*Continuity*: as the input**x**changes slightly, the output**y**changes slightly, i.e., the relation between**x**and**y**is continuous.

These conditions could be applied to linear systems as conditions on the matrix **A**. Let the matrix *n* × *m* with its elements being real values. The matrix equation, **y**_{n × 1} = **A**_{n × m} **x**_{m × 1}, is equivalent to *n* linear equations with *m* unknowns. The matrix **A** is a linear transformation that maps input signals from its domain **y∈Rn**, we could identify three cases based on the values of *n* and *m*.

### 2.1. Underdetermined linear systems

In this case, *n* < *m*, i.e., the number of equations is less than the number of unknowns,

If these equations are consistent, *Hadamard’s Existence* condition will be satisfied. However, Hadamard’s *Uniqueness* condition is not satisfied because the *Null Space*(**A**) ≠ {**0**}, i.e., there exist **z** ≠ **0** ∈ *Null Space*(**A**) such that,

This linear system is called *under-determined* because its equations, i.e., system constraints, are not enough to uniquely determine **x** [4, 5]. Thus, the inverse of **A** does not exist.

### 2.2. Overdetermined linear systems

In this case, *m* > *n*, the number of equations is more than the number of unknowns,

If these equations are consistent, Hadamard’s *Existence* condition will not be satisfied. However, *Hadamard*’s *Uniqueness* condition will be satisfied, if **A** has full rank. In this case, *Null Space*(**A**) = {**0**}, i.e.,

This linear system is called *over-determined*, because its equations, i.e., system constraints, are too many for **x** to exist [4, 5]. Also, the inverse of **A** does not exist.

### 2.3. Square linear systems

The case where *m* = *n*, the number of equations is equal to the number of unknowns,

If **A** has full rank, its *Null Space*(**A**) = {**0**} and both Hadamard’s *Existence* and *Uniqueness* conditions will be satisfied. In addition, if **A** has a small condition number, the relation between **x**, **y** will be continuous, and Hadamard’s *Continuity* condition will be satisfied [4, 5, 10]. In this case, the inverse problem formulated by this system of linear equations is well-posed.

## 3. Methods for signal estimation

In this section, we will focus on the estimation of an input signal **x** from a noisy measurement **y** of the output of a linear system **A**.

The linear system shown in Figure 1, could be modeled as,

where **v** is additive Gaussian noise. As a consequence of the *Central Limit Theorem*, this assumption of Gaussian distributed noise is valid for many output measurement setups.

*Statistical Estimation Theory* allows one to obtain an estimate **x̂** of a signal **x** that is input to a known system **A** from measurement **y** (see Figure 2) [11, 12]. However, this estimate **x̂** is not unique, as it depends on the choice of the used estimator from the different ones available. In addition to measurement **y**, if other information about the input signal is available, it could be used as prior information to constrain the estimator to produce a better estimate of **x**. Signal estimation for overdetermined systems could be achieved without any prior information about the input signal. However, for underdetermined systems, prior information is necessary to ensure a unique estimate.

### 3.1. Least squares estimation

If there is no information available about the statistics of the measured data,

least squares estimation could be used. The least squares estimate is obtained by minimizing the square of the *L*_{2} norm of the error between the measurement and the linear model, **v** = **y** − **Ax**. It is given by

The *L*_{2} norm is a special case of the *p*-norm of a vector, where *p* = 2, that is defined as **x** is considered deterministic, so its statistics are not required. The noise **v** in this formulation is implicitly assumed to be white noise with variance *σ*^{2} [13, 14]. Least squares estimation is typically used to estimate input signals **x** in overdetermined problems. Since **x̂** is unique in this case, no prior information, additional constraints, for **x** is necessary.

### 3.2. Weighted least squares estimation

If the noise **v** in Eq. (8) is not necessarily white and its second order statistics, i.e., mean and covariance matrix, are known, then weighted least squares estimation could be used to further improve the least squares estimate. In this estimation method, measurement errors are not weighted equally, but a weighting matrix **C** explicitly specifies such the weights. The weighted least squares estimate is given by

We note that the least squares problem, Eq. (9), is a special case of the weighted least squares problem, Eq. (10), when **C** = *σ*^{2}**I**.

### 3.3. Regularized least squares estimation

In underdetermined problems, the introduction of additional constraints on **x**, also known as *regularization*, could ensure the uniqueness of the obtained solution. Standard least squares estimation could be extended, through regularization, to solve underdetermined estimation problems. The regularized least squares estimate is given by

where **L** is a matrix specifying the additional constraints and *λ* is a *regularization parameter* whose value determines the relative weights of the two terms in the objective function. If the combined matrix **AL** has full rank, the regularized least squares estimate **x̂** is unique [4]. In this optimization problem, the unknown **x** is once again considered deterministic, so its statistics are not required. It is worthwhile noting that while *regularization* is necessary to solve underdetermined inverse problems, it could also be used to improve numerical properties, e.g., condition number, of either linear overdetermined or linear square inverse problems.

### 3.4. Maximum likelihood estimation

If the probability distribution function (pdf) of the measurement **y**, parameterized by an unknown deterministic input signal **x**, is available, then the *maximum likelihood* estimate of **x** is given by,

This maximum likelihood estimate **x̂** is obtained by assuming that measurement **y** is the most likely measurement to occur given the input signal **x**. This corresponds to choosing the value of **x** for which the probability of the observed measurement **y** is maximized. In maximum likelihood estimation, the negative log of the likelihood function, *f*(**y**|**x**), is typically used to transform Eq. (12) into a simpler minimization problem. When, *f*(**y**| **x**) is a Gaussian distribution, **N**(**Ax**, **C**), minimizing the negative log of the likelihood function is equivalent to solving the weighted least squares estimation problem.

### 3.5. Bayesian estimation

If the conditional pdf of the measurement **y**, given an unknown random input signal **x**, is known, in addition to the marginal pdf of **x**, representing prior information about **x**, is given, then a Bayesian estimation method would be possible. The first step to obtain one of the many possible Bayesian estimates of **x** is to use Bayes rule to obtain the *a posteriori* pdf,

Once this *a posteriori* pdf is known, different Bayesian estimates **x̂** could be obtained. For example, the *minimum mean square error* estimate is given by,

while the *maximum a priori* (MAP) estimate is given by,

We note that the maximum likelihood estimate, Eq. (12), is a special case of the MAP estimate, when *f*(**x**) is a uniform pdf over the entire domain of **x**. The use of prior information is essential to solve underdetermined inverse problems, but it also improves the numerical properties, e.g., condition number, of either linear overdetermined or linear square inverse problems.

#### 3.5.1. Bayesian least squares estimation

In least squares estimation, the vector **x** is assumed to be an unknown deterministic variable. However, in Bayesian least squares estimation, it is considered a vector of scalar random variables that satisfies statistical properties given by an *a priori* probability distribution function [5]. In addition, in least squares estimation, the *L*_{2} norm of the measurement error is minimized, while in Bayesian least squares estimation, it is the estimation error, **e=x̂-x**, not measurement error, that is used [5]. Since **x** is assumed to be a random vector, the estimation error **e** will also be a random vector. Therefore, the Bayesian least squares estimate could be obtained by minimizing the condtional mean of the square of the estimation error, given measurement, **y**,

When **x** has a Gaussian distribution and **A** represents a linear system, then measurement **y** will also have a Gaussian distribution. In this case, the Bayesian least squares estimate given by Eq. (16) could be reinterpreted as a regularized least squares estimate given by,

where *μ* is the mean of the *a priori* distribution of **x** [5]. Therefore, a least squares Bayesian estimate is analogous to a regularized least squares estimate, where *a priori* information about **x** is expressed as additional constraints on **x** in the form of a *regularization* term.

#### 3.5.2. Advantages of Bayesian estimation over other estimation methods

Bayesian estimation techniques could be used, given that a reliable *a priori* distribution is known, to obtain an accurate estimate of a signal **x**, even if the number available measurements is smaller than the dimension of the signal to estimated. In this underdetermined case, Bayesian estimation could accurately estimate a signal while un-regularized least squares estimation or maximum likelihood estimation could not. The use of prior information in Bayesian estimation could also improves the numerical properties, e.g., condition number, of either linear overdetermined or linear square inverse problems. This could be understood by keeping in mind the mathematical equivalence between obtaining one scalar measurement related to **x**, and specifying one constraint that **x** has to satisfy. Therefore, as the number of available measurements significantly increases, both Bayesian and maximum likelihood estimates would converge to the same estimate.

Bayesian estimation also could be easily adapted to estimate dynamic signals that change over time. This is achieved by sequentially using past estimates of a signal, e.g., **x**_{t − 1}, as prior information to estimate its current value **x**_{t}. More generally, Bayesian estimation could be easily adapted for *data fusion*, i.e., combination of multiple partial measurements to estimate a complete signal in remote sensing, stereo vision and tomographic imaging, e.g., Positron emission tomography (PET), Magnetic resonance imaging (MRI), computed tomography (CT) and optical coherence tomography (OCT). Bayesian methods could also easily fuse all available prior information to provide an estimate based on measurements, in addition to all known information about a signal.

Bayesian estimation techniques could be extended in straight forward ways to estimate output signals of nonlinear systems or signals that have complicated probability distributions. In these cases, numerical Bayesian estimates are typically obtained using Monte Carlo methods.

#### 3.5.3. Sparsity as prior information for underdetermined Bayesian signal estimation

Sparse signal representation means the representation of a signal in a domain where most of its coefficients are zero. Depending on the nature of the signal, one could find an appropriate domain where it would be sparse. This notion could be useful in signal estimation because assuming that the unknown signal **x** is sparse could be used as prior information to obtain an accurate estimate of it, even if only a small number of measurements are available. The rest of this chapter will focus on using signal sparsity as prior information for underdetermined Bayesian signal estimation.

## 4. Sparse signal representation

As shown in Figure 3, a sinusoid is a dense signal in the time domain. However, it could be represented by a single value, i.e., it has a sparse representation, in the frequency domain.

We note that any signal could have a sparse representation in a suitable domain [15]. A sparse signal representation means a representation of the signal in a domain where most of its coefficients are zero. Sparse signal representations have many advantages including:

A sparse signal representation requires less memory for its storage. Therefore, it is a fundamental concept for signal compression.

A sparse signal representation could lead to simpler signal processing algorithms. For example, signal denoising could be achieved by simple thresholding operations in a domain where the signal is known to be sparse.

Sparse signal representations have fewer coefficients than dense signal representations. Therefore, the computational cost for sparse representations would be lower than for dense representations.

### 4.1. Signal representation using a dictionary

A *dictionary* *φ*_{n}}_{nϵΓ}, indexed by a parameter *n ϵ* Γ equal to the dimension of a signal *f*, where we could represent *f* as a linear combination [16],

If the vectors {*φ*_{n}}_{nϵΓ} are linearly independent, then such dictionary is called a *basis*. Representing a signal as a linear combination of sinusoids, i.e., using a *Fourier* dictionary, is very common. *Wavelet* dictionaries and *Chirplet* dictionaries are also common dictionaries for signal representation. Dictionaries could be combined together to obtain a larger dictionary, where *n ϵ* Γ is larger than the dimension the signal *f*, that is called an *overcomplete* dictionary or a frame.

#### 4.1.1. Signal representation using a basis

A set of vectors form a basis for *V* is a set *X* of linearly independent vectors such that every vector in *V* is a linear combination of elements in *X*. A vector space *V* is finite dimensional if it has a finite number of basis vectors [17].

Depending on the properties of {*φ*_{n}}_{nϵΓ}, bases could be classified into different types, e.g., orthogonal basis, orthonormal basis, biorthogonal basis, global basis and local basis. For an orthogonal basis, its basis vectors in the vector space *V* are mutually orthogonal,

For an orthonormal basis, its basis vectors in the vector space *V* are mutually orthogonal and have unit length,

where *δ*(*m* − *n*) is the Kronecker delta function. For a biorthogonal basis, its basis vectors are not orthogonal to each other, but they are orthogonal to vectors in another basis,

In addition, depending on the domain (support) on which these basis vectors are defined, we could also classify a basis as either global or local. Sinusoidal basis vectors used for the discrete Fourier transform are defined on the entire domain (support) of *f*, so they are considered a *global* basis. Many wavelet basis vectors used for the discrete wavelet transform are defined on only part of the domain (support) of *f*, so they are considered a *local* basis.

#### 4.1.2. Signal representation using a frame

A frame is a set of vectors {*φ*_{n}}_{nϵΓ} that spans *f* from the inner products {⟨*f*, *φ*_{n}⟩}_{nϵΓ}. A frame allows the representation of a signal as a set of frame coefficients, and its reconstruction from these coefficients in a numerically stable way

Frame theory analyzes the completeness, stability, and redundancy of linear discrete signal representations [18]. A frame is not necessarily a basis, but it shares many properties with bases. The most important distinction between a frame and a basis is that the vectors that comprise a basis are linearly independent, while those comprising frame could be linearly dependent. Frames are also called *overcomplete* dictionaries. The redundancy in the representation of a signal using frames could be used to obtain sparse signal representations.

### 4.2. Sparse signal representation as a regularized least squares estimation problem

If designed to concentrate the energy of a signal in a small number of dimensions, an orthogonal basis would be the minimum-size dictionary that could yield a sparse representation of this signal [15]. However, finding an orthogonal basis that yields a highly sparse representation for a given signal is usually difficult or impractical. To allow more flexibility, the orthogonality constraint is usually dropped, and *overcomplete* dictionaries (frames) are usually used. This idea is well explained in the following quote by Stephane Mallat:

*“In natural languages, a richer dictionary helps to build shorter and more precise sentences. Similarly, dictionaries of vectors that are larger than bases are needed to build sparse representations of complex signals. Sparse representations in redundant dictionaries can improve pattern recognition, compression, and noise reduction but also the resolution of new inverse problems. This includes super resolution, source separation, and compressed sensing” [15].*

Thus representing a signal using a particular *overcomplete* dictionary has the following goals [16]

Sparsity—this representation should be more sparse than other representations.

Super resolution—the resolution of the signal when represented using this dictionary should be higher than when represented in any other dictionary.

Speed—this representation should be computed in

*O*(*n*) or*O*(*n log*(*n*)) time.

A simple way to obtain an *overcomplete* dictionary **A** is to use a union of basis **A**_{i} that would result in the following representation of a signal **y**,

where **A** is a *n* × *m* matrix representing the dictionary and **x** are the coefficients representing **y** in the domain defined by **A**. Since **A** represents an *overcomplete* dictionary, the number of its rows will be less than the number of its columns. Eq. (23) is a formulation of the signal representation problem as an underdetermined inverse problem.

To obtain a sparse solution for Eq. (23) one needs to find an *m* × 1 coefficient vector **x̂**, such that,

where ‖**x**‖_{0} is the cardinality of vector **x**, i.e., its number of nonzero elements, and *λ* > 0 is a regularization parameter that quantifies the tradeoff between the signal representation error,**x**‖_{0} [19]. The cardinality of vector **x** is sometimes referred to as the *L*_{0} norm of **x**, even though ‖**x**‖_{0} is actually a *pseudo norm* that does not satisfy the requirements of a norm in *underdetermined* in the case of an *overcomplete* dictionary. Because of the correspondence between regularized least squares estimation and Bayesian estimation, the problem of finding a sparse representation of a signal could be formulated as a Bayesian estimation problem.

## 5. Compressed sensing

Compressed sensing involves the estimation of a signal using a number of measurements that are significantly less than its dimension [20]. By assuming that the unknown signal is sparse in the domain where the measurements were acquired, one could use this sparsity constraint as prior information to obtain an accurate estimate of the signal from relatively few measurements.

Compressed sensing is closely related to *signal compression* that is routinely used for efficient storage or transmission of signals. Compressed sensing was inspired by this question: instead of the typical signal acquisition followed by signal compression, is there a way to acquire (sense) the compressed signal in the first place? If possible, it would significantly reduce the number of measurements and the computation cost [20]. In addition, this possibility would allow acquisition of signals that require extremely high, hence impractical, sampling rates [21]. As an affirmative answer to this question, compressed sensing was developed to combine signal compression with signal acquisition [20]. This is achieved by designing the measurement setup to acquire signals in the domain where the unknown signal is assumed to be sparse.

In compressed sensing, we consider the estimation of an input signal **x∈Rn** from *m* linear measurements, where *m* ≪ *n*. As discussed above, this problem could be written as an underdetermined linear system,

where **y∈Rm** and **A∈Rm×n** represent the measurements and measurement (sensing) matrix, respectively.

Assuming that the unknown signal **x** is *s*-sparse, i.e., **x** ∈ ∑_{s} has only *s* nonzero elements, in the domain specified by the measurement (sensing) matrix **A**, and assuming that **A** satisfies the restricted isometry property (RIP) of order 2*s*, i.e., there exists a constant *δ*_{2s} ∈ (0, 1) such that,

for all **z** ∈ ∑_{2s}, then **x** could be reconstructed from *m* ≥ *s* measurements by different optimization algorithms [20]. When the measurements **y** are noiseless, **x** could be exactly estimated from,

However, when the measurements **y** are contaminated by noise,**x** could be obtained as the regularized least squares estimate,

This minimization problem could also be mathematically reformulated and solved as a Bayesian estimation problem.

## 6. Obtaining sparse solutions for signal representation and signal estimation problems

From Sections 4 and 5 we note that the problem of obtaining a sparse signal representation, Eq. (24) and the problem of sparse signal estimation in compressed sensing, Eq. (28), both have the same mathematical form [11, 22],

In this section, we describe different approaches to solving this minimization problem. From Eq. (29), we note that the first term of its RHS, **x**‖_{0}, represents the cardinality (number of nonzero coefficients) of the unknown signal. The regularization parameter *λ* specifies the tradeoff between these two terms in the objective function. The selection of an appropriate value of *λ* to balance the reconstruction, or fitting error, and signal sparsity is very important. Regularization theory and Bayesian approaches could provide ways to determine optimal values of *λ* [23–26].

Convex optimization problems is a class of optimization problems that are significantly easier to solve compared to nonconvex problems [34]. Another advantage of convex optimization problems is that any local solution, e.g., a local minimum, is guaranteed to be a global solution. We note that obtaining an exact solution for the minimization problem in Eq. (29) is difficult because it is nonconvex. Therefore, one could either seek an approximate solution to this nonconvex problem or approximate this problem by a convex optimization whose exact solution could be obtained easily.

Considering the general regularized least squares estimation problem,

we note that it is a nonconvex optimization problem for 0 ≤ *p* < 1 and a convex optimization problem for *p* ≥ 1. One alternative to approximate Eq. (29) by a convex optimization problem, one could relax the strict condition of minimizing the cardinality of the signal, ‖**x**‖_{0}, by replacing by it by the sparsity-promoting condition of minimizing the *L*_{1} norm of the signal, ‖**x**‖_{1}. Another alternative to approximate Eq. (29) by another nonconvex optimization problem that is easier to solve than the original problem using a Bayesian formulation, is to replace ‖**x**‖_{0} by ‖**x**‖_{p}, 0 < *p* < 1. The minimization of Eq. (30) using ‖**x**‖_{p}, 0 < *p* < 1 would result in a higher degree of signal sparsity compared to when ‖**x**‖_{1} is used. This could be understood visually by examining Figure 4, that shows the shapes of two-dimensional unit balls using (pseudo)norms with different values of *p*.

We explain further details in the following subsections.

### 6.1. Obtaining a sparse signal solution using *L*_{0} minimization

The sparsest solution of the regularized least squares estimation problem, Eq. (29) would be obtained when *p* = 0 in ‖**x**‖_{p}. As shown in Figure 5, the solution of the regularized least squares problem, **x̂**, is given by the intersection of the circles, possibly ellipses, representing the solution of the unconstrained least squares estimation problem and the unit ball using *L*_{0} representing the constraint of minimizing *L*_{0}. In this case of minimizing *L*_{0}, the unconstrained least squares solution will always intersect the unit ball at an axis, this yielding the most possible sparse solution. However, as mentioned earlier, this *L*_{0} minimization problem is difficult to solve because it is nonconvex. Approximate solutions for this problem could be obtained using greedy optimization algorithms, e.g., Matching Pursuits [27] and Least Angle Regression (LARS) [28].

### 6.2. Obtaining a sparse signal solution using *L*_{1} minimization

On relaxing the nonconvex regularized least squares using *L*_{0} minimization problem, by setting *p* = 1, we obtain the convex *L*_{1} minimization problem. As shown in Figure 4(c), the unit ball using the *L*_{1} norm covers a larger area than the unit ball using the *L*_{0} pseudo norm, shown in Figure 4(a). Therefore, as shown in Figure 6, the solution for the regularized least squares problem using the *L*_{1} minimization would be sparse, but it should not be expected to be as sparse as the *L*_{0} minimization problem.

This *L*_{1} minimization problem could be solved easily using various algorithms, e.g., Basis Pursuits [16], Method of frames (MOF) [29], Lasso [30, 31], and Best Basis Selection [32, 33]. A Bayesian formulation of this *L*_{1} minimization problem is also possible by assuming that the *a priori* probability distribution of **x** is Laplacian, **x** ~ *e*^{− |x|}.

### 6.3. Obtaining a sparse signal solution using *L*_{0 − 1} minimization

As discussed above, solving the regularized least squares problem with *L*_{0} minimization should yield the sparsest signal solution. However, only approximate solutions are available for this difficult nonconvex problem. Alternatively, solving the regularized least squares problem with *L*_{1} minimization should yield an exact sparse solution that would be less sparse than in the *L*_{0} case, but it is considerably easier to obtain.

The regularized least squares problem could also be formulated as an *L*_{0 − 1} minimization problem. As ‖**x**‖_{p}, 0 < *p* < 1,that we abbreviate as *L*_{0 − 1}, is not an actual norm, this optimization problem would be nonconvex [34]. The advantage of using *L*_{0 − 1} minimization is that, as shown in Figure 4(b), compared to unit ball using the *L*_{1} norm, the unit ball using the *L*_{0 − 1} pseudo norm has a narrower area that is concentrated around the axes. Therefore, as shown in Figure 7, the *L*_{0 − 1} minimization problem should yield a sparser solution compared to the *L*_{1} minimization problem.

Another advantage of using *L*_{0 − 1} minimization is that this nonconvex optimization problem could be easily formulated as a Bayesian estimation problem that could be solved using Markov Chain Monte Carlo (MCMC) methods. As shown in Figure 8, the product of *student-t* probability distributions has a shape similar to the unit ball using the *L*_{0 − 1} pseudo norm, so *student-t* distributions could be used as *a priori* distributions to approximate the *L*_{0 − 1} pseudo norm.

### 6.4. Bayesian method to obtain a sparse signal solution using *L*_{0 − 1} minimization

As mentioned in Section 3.5, the first step to obtaining one of the many possible Bayesian estimates of **x** is to use Bayes rule to obtain the *a posteriori* pdf,

Using this *a posteriori* distribution, one could obtain a sparse signal solution using *L* _{0 − 1} minimization, as the *maximum a posteriori* (MAP) estimate given by Eq. (15). Compared to other Bayesian estimates, the MAP estimate could be easier to obtain because the calculation of the normalizing constant, ∫*f*(**y**|**x**)*f*(**x**),would not be needed. The maximization of the product of conditional probability distribution of **y** given **x** and the *a priori* distribution of **x** is equivalent to the minimizing of the sum of their negative logarithms,

In the case of white Gaussian measurement noise, *p*(**y**|**x**) ~ *N*_{x} (**Ax**, *σ*^{2}**I**) where *a priori* probability *p*(**x**) corresponding to *L* _{0 − 1} minimization could be represented as a product of univariate *student-t* probability distribution functions [14],

where Γ is the Gamma function, and *ϑ* is the number of degrees of freedom of the *student-t* distribution. Since this *a priori* distribution function is not an exponential function, we would use Eq. (15) instead of Eq. (32) to obtain the MAP estimate.

Because the prior is not a Gaussian distribution, there is no simple closed form expression for the posterior, *p*(**x**|**y**) with a *student-t a priori* probability distribution. However, we could express each *student-t* distribution as an infinite weighted sum of Gaussian distributions, where the hidden variables *h*_{i} determine their variances [14].

where the matrix **H** contains the hidden variables *ϑ*/2, *ϑ*/2). Using this approximation, the *a posteriori* pdf could be written as

The product of two Gaussian distributions is also a Gaussian distribution [35],

where the mean and covariance (**μ**, **Σ**) of the new Gaussian distribution in Eq. (36) is given by,

and *k* is a constant. Therefore, we could simplify the product of two the Gaussian distributions given in Eq. (35) as,

From Eqs. (35) and (38) we could write *p*(**x**|**y**) as,

We still could not compute the integral in Eq. (39) in closed form. However, we could maximize the RHS of Eq. (39) over the hidden variables **H** to obtain an approximation for the *a posteriori* probability distribution function

Eq. (40) would be a good approximation of *p*(**x**|**y**), if the actual distribution over the hidden variables is concentrated tightly around its mode [14]. When *h*_{i} has a large value, its corresponding *i*th component of the *a priori* probability distribution function *p*(**x**) would have a small variance, *i*th component of *p*(**x**) could be set to zero. Therefore, this *i*th dimension of the prior *p*(**x**) would not contribute to the solution of Eq. (30), thus increasing its sparsity.

Since both Gaussian and gamma pdfs in Eq. (40) are members of the exponential family of probability distributions, we could obtain **x̂MAP** by maximizing the sum of their logarithms. Section 3.5 in [11] and Section 8.6 in [14] describe an iterative optimization method to obtain **x̂MAP** from the approximate *a posteriori* probability distribution function given by Eq. (40).

## 7. Conclusion

In this chapter, we described different methods to estimate an unknown signal from its linear measurements. We focused on the underdetermined case where the number of measurements is less than the dimension of the unknown signal. We introduced the concept of signal sparsity and described how it could be used as prior information for either regularized least squares or Bayesian signal estimation. We discussed compressed sensing and sparse signal representation as examples where these sparse signal estimation methods could be applied.