Bayesian Inference and Compressed Sensing

This chapter provides the use of Bayesian inference in compressive sensing (CS), a method in signal processing. Among the recovery methods used in CS literature, the convex relaxation methods are reformulated again using the Bayesian framework and this method is applied in different CS applications such as magnetic resonance imaging (MRI), remote sensing, and wireless communication systems, specifically on multipleinput multiple-output (MIMO) systems. The robustness of Bayesian method in incorporating prior information like sparse and structure among the sparse entries is shown in this chapter.


Introduction
In order to estimate parameters in a signal, one can apply wisdoms of the two schools of thoughts in statistics called the classical (also called the frequentist) and the Bayesian. These methods of computing are competitive with each other at times. The definition of probability is where the basic difference arises from. The frequentist define P(A) as a long-run relative frequency with which A occurs in identical repeats of an experiment, whereas Bayesian defines P(A|B) as a real number measure of the probability of a proposition A, given the truth of the information represented by proposition B. So under Bayesian theory, probability is considered as an extension of logic [1,2]. Probabilities represent the investigator degree of belief-hence it is subjective. But this is not acceptable under classical theory, making it to be not flexible. To add on the differences, under the classical inference, parameters are not random, they are fixed and prior information is absent. But under the Bayesian, parameters are random variables, and prior information is an integral part, and the Bayesian has no excuse for that. Since one is free to invent new estimators or confidence intervals or hypothesis test, adhockery exists and hence frequentist approach lack consistency whereas Bayesian theory is flexible and consistent [1][2][3][4][5][6][7][8][9]. Therefore, Bayesian inference is our main focus applied to a special paradigm in signal processing in this chapter.
After presenting the theoretical frameworks, Bayesian theory, CS, and convex relaxation methods in Section 2, the use of Bayesian inference in CS problem by considering two priors modeling the sparsity and clusteredness is shown in Section 3. In Section 4, we present three examples of applications that show the connection of the two theories, Bayesian and compressive sensing. In Section 5, the conclusion is given briefly.

Theoretical
Using the same framework, consider model M j and a vector of parameter θ. We infer what the model's parameter θ might be, given the data, D, and a prior information I. Using Bayes' theorem, the probability of the parameters θ given model M j , data D, and information I is given by P θjD, M j ,I ÀÁ ¼ PD jθ,M j ,I ÀÁ P θjM j ,I ÀÁ PD jM j ,I ÀÁ; where P(θ|D, M j , I), is posterior probability, P(θ|M j , I) is the non data information about θ, called prior probability distribution function, while P(D|θ, M j , I) is the density of the data conditional on the parameters of the model, called likelihood. P(D|M j , I) is called the evidence of model M j , or the normalizing constant, given by: P(θ|D, M j , I) is the fundamental interest for the first level of inference called model fitting. It is the task of inferring what the model parameters might be given the model and the data. Further, we can do inference on higher level, which is comparing models M j . In the light of prior information I and data D, a given set of models {M 1 , M 2 ,⋯, M n } is most likely to be the correct one. Now focusing on the first level of inference, we can ignore the normalizing constant in (3) since it has no relevance at this level of inference about the parameters θ. Hence we get: The posterior probability is proportional to the Prior probability times the Likelihood. Eq. (5) is called Updating Rule [1,3], in which the data allow us to update our prior views about θ, and as a result, we get the posterior which combines both the data and non-data information of θ.
As an example for a binomial trial, let us have beta distribution as a prior and as a result we get posterior distribution which is beta distribution. Figure 1 shows that the posterior density is taller and narrower than the prior density. It therefore favors strongly a smaller range of θ values, reflecting the fact that we now have more information. That is why inference based on the posterior distribution is superior to the one only based on the likelihood. Now, we first find the maximum of the posterior distribution called maximum a posteriori (MAP). It defines the most probable value for the parameters denoted b θ MP . MAP is related to Fisher's methods of maximum likelihood estimation (MLE), b θ ML .Iff is the sampling distribution of D, then the likelihood function of D:θ ↦ fD jθ ðÞ and the maximum likelihood estimation of θ: But under Bayesian inference, let g be a prior distribution of θ, then the posterior distribution of θ becomes θ ML ↦ fD jθ ðÞ g θ ðÞ fD ðÞ (7) and the maximum a posteriori estimation of θ: Inference based on the posterior is not an easy task since it involves multiple integral, which are cumbersome to solve at times. However, it can be computed in several ways: Numerical optimization (like Conjugate gradient method, Newton method,…), modification of an expectation-maximization algorithm and others. As we can see it from (22) and (8), the difference between MLE and MAP is the prior distribution. The latter can be considered as a regularization of the former. Here we can summarize the posterior distribution by the value of the best fit parameters θ MP and error bars (confidence intervals) on the best fit parameters. Error bars can be found from the curvatures of the posterior. To proceed further, we replace the random variable D and θ by vectors y and x and we assume prior distributions on x in the next section.

Compressive sensing
Compressive sensing (CS) is a paradigm to capture information at lower rate than the Nyquist-Shannon sampling rate when signals are sparse in some domain [10][11][12][13]. CS has recently gained a lot of attention due to its exploitation of signal sparsity. Sparsity, an inherent characteristic of many natural signals, enables the signal to be stored in a few samples and subsequently be recovered accurately.
As a signal processing scheme, CS follows a similar fashion: encoding, transmission/storing, and decoding. Focusing on the encoding and decoding of such a system with noisy measurement, the block diagram is given in Figure 2. At the encoding side, CS combines the sampling and compression stages of a traditional signal processing into one step by measuring few samples that contain maximum information about the signal. This measurement/sampling is done by linear projections using random sensing transformations as shown in the landmark papers by the authors mentioned above. Having said this, let us define the CS problem formally as follows: Find the k-sparse signal vector x ∈ R N provided the measurement vector y ∈ R M , the measurement matrix A ∈ R MÂN and the under-determined set of linear equations as y ¼ Ax; (9) where k ≪ M ≪ N.
One can ask again two of the questions here, in relation to the standard CS problem. First, how should we design the matrix A to ensure that it preserves the information in the signal x? Second, how can we recover the original signal x from measurements y [14]? To address the first question, the solution for the CS problem presented here is dependent on the design of A. This matrix can be considered as a transformation of the signal from the signal space to the measurement space, Figure 3 [15]. There have been different criteria that matrix A should satisfy to have meaningful reconstruction. One of the main criteria is given in [11]. The authors defined the sufficient condition that matrix A should satisfy for the reconstruction of the signal x.
It is called the Restricted Isometric Property (RIP) and it is defined below.
is satisfied, then A fulfills RIP of order k with radius δ k .
An equivalent description of the RIP is to say that all subsets of k columns taken from A are nearly orthogonal (the columns of A cannot be exactly orthogonal since we have more columns than rows) [16]. For example, if a matrix A satisfies the RIP of order 2k, then we can interpret (10) saying that A approximately preserves the distance between any pair of k-sparse vectors. For random matrix A, the following theorem is one of the results in relation to RIP for the noiseless CS problem, provided that the entries of the random matrix A are drawn from some distributions which are given later.

Theorem 1. (Perfect Recovery Condition, Candes and Tao [13])
If A satisfies the RIP of order 2k with radius δ 2k , then for any k-sparse signal x sensed by y=Ax, x is with high probability perfectly recovered by the ideal program This means, if A satisfies the RIP of order k with radius δ k , then for any k 0 < k, A satisfies the RIP of order k 0 with constant δ k 0 < δ k [?]. Note that, this theorem is stated for the noiseless CS problem and it is possible to extend it for the noisy CS system. The proof of these theorems is deferred to the literature mentioned, [13], in for the sake of space.
Under conventional sensing paradigm, the dimension of the original signal and the measurement should be at least equal. But in CS, the measurement vector can be far less than the original. While at the decoding side, reconstruction is done using nonlinear schemes. Eventually, the reconstruction is more cumbersome than the encoding which was only projections from a large space to a smaller space. On the other hand, finding a unique solution that satisfies the constraint that the signal itself is sparse or sparse in some domain is complex in nature. Fortunately, there are many algorithms to solve the CS problem, such as iterative methods such as greedy iterative algorithms [17] and iterative thresholding algorithms [18]. This chapter focuses merely on the convex relaxations methods [12,13]. The regularizing terms in these methods can be reinterpreted as prior information under Bayesian inference. We consider a noisy measurement and apply convex relaxation algorithms for robust reconstruction.

Convex relaxation methods for CS
Various methods for estimating x may be used. We have the least square (LS) estimator in which no prior information is applied: which performs very badly for the CS estimation problem we are considering. In order to incorporate the methods called convex relaxation, let us define an important concept first.

Definition 3. (Unit Ball)
A unit ball in l p -space of dimension N can be defined as Unit balls corresponding to p = 0, p = 1/2, p =1,p =2,p = ∞, and N = 2, the balls are shown in Figure 4.
The exact solution for the noiseless CS problem is given by However, minimizing l 0 -norm is a non-convex optimization problem which is NP-hard [19]. By relaxing the objective function to convexity, it is possible to get good approximation. That is, replacing the l 0 -norm by the l 1 -norm, one can find a problem which is tractable. Note that it is also possible to use other l p -norms to relax the condition given by l 0 . However, keeping our focus on l 1 -norm, consider the minimization problem instead of (14).
The solution of the relaxed problem (15) gives the same as that of (14) and this equivalence was provided by Donoho and Huo in [20].  If A satisfies the RIP of order 2k with radius δ 2k < ffiffi ffi is equivalent to (11) and will find the same unique b x.
Justified by this theorem, (15) is an optimization problem which can be solved in polynomial time and the fact that it gives the exact solution for the problem (14) under some circumstance has been one of the main reasons for the recent developments in CS. There is a simple geometric intuition on why such an approach gives good approximations. Among the l p -norms that can be used in the construction of CS related optimization problems, only those which are convex give rise to a convex optimization problem which is more feasible than the non-convex counter parts, which means l p -norms with only p ≥ 1 satisfy such a condition. On the other hand, l p -norms with p > 1 do not favor sparsity, for example, l 2 -norm minimization tends to spread reconstruction across all coordinates even if the true solution is sparse. But l 1 -norm is able to enforce sparsity. The intuition is that l 1 -minimization solution is most likely to occur at corners or edges, not faces [21,45]. That is why l 1 -norm became famous for CS. Further, in CS literature, convex relaxation is presented as either l 2 -penalized l 1 -minimization called Basis Pursuit Denoising (BPDN) [22] or l 1 -penalized l 2 -minimization called least absolute shrinkage and selection operator (LASSO) [45], which are equivalent and effective in estimating a high-dimensional data.
Usually real world systems are contaminated with noise, w, and in this chapter, the focus is on such problems. The noisy recovery problem becomes a simple extension of (15), where E is a bound on | |w| | 2 . The real problem for (17) is stability. Introducing small changes in the observations should result in small changes in the recovery. We can visualize this using the balls shown in Figure 5.
Both the l 0 and l 1 -norms give exact solutions for the noise-free CS problem while giving a close solution for the noisy problem. However, the l 2 -norm gives worst approximation in both cases compared to the other l p -norms with p < 2 (see Figure 5). Moreover, (17) is equivalent to an unconstrained quadratic programming problem as as it will be shown later as LASSO, where γ is a tuning parameter. The equivalency of (17) and (18) is shown in [23,24]. In this chapter, the generalized form of the minimization problem in (18) with different l p -norm regularization is considered, that is, Further, this chapter provides the use of Bayesian framework in compressive sensing by incorporating two different priors modeling the sparsity and the possible structure among the sparse entries in a signal. Basically, it is the summary of the recent works [2,[25][26][27].

Bayesian inference used in CS problem
Under Bayesian inference, consider two random variables x and y with probability density function (pdf) p(x)andp(y), respectively. The product rule gives us p(x,y)=p(x|y)p(y)=p(y|x)p(x)and Bayes' theorem provides p xjy ðÞ ¼ p yjx ðÞ p x ðÞ p y ðÞ : Further, the maximum a posteriori (MAP), b x MP , is defined as MAP is related to Fisher's methods of maximum likelihood estimation (MLE), b x ML : Figure 5. l p -norm approximations: the constraints for the noise-free CS problem is given by the bold line while the shaded region is for the noisy one.
Bayesian Inference and Compressed Sensing http://dx.doi.org/10.5772/intechopen.70308 As we can see it from (21) and (22), the difference between MAP and MLE is the prior distribution. The former can be considered as a regularized form of the latter. Since we apply Bayesian inference we assume further different prior distributions on x.

Sparse prior
The estimators of x resulting from (19) for the sparse problem we consider in this chapter, can be presented as a maximum a posteriori (MAP) estimator under the Bayesian framework as in [28]. We show this by defining a prior probability distribution for x on the form where the regularizing function f: χ!R is some scalar-valued, non negative function with χ ⊆ R which can be expanded to a vector argument by such that for sufficiently large u, ð x ∈ R N exp Àuf x ðÞ ðÞ dx is finite. Furthermore, let the assumed variance of the noise be given by σ 2 ¼ λ u ; where λ is the system parameter which can be taken as λ = σ 2 u.
Since the pdf of the noise w is gaussian, the likelihood function of y given x is given by Together with (20) and (23) which is the LMMSE estimator. But we ignore this estimator in our analysis since the results are not sparse. However, the following two estimators are more interesting for CS problems since they enforce sparsity into the vector x.
2. LASSO estimator: when f(x)=| |x| | 1 we get the LASSO estimator and (26) becomes, 3. Zero-norm regularization estimator: when fx=∥x∥ 0 , we get the zero-norm regularization estimator and (26) becomeŝ As mentioned earlier, (29) is the best solution for estimation of the sparse vector x, but is NPcomplete. The worst approximation for the sparse problem considered is the L2-regularization solution given by (27). However, the best approximation is given by Eq. (28) and its equivalent forms. We have used some of the algorithms in literature in our simulation which are considered as equivalent to this approximations such as Bayesian compressive sensing (BCS) [29] and L1-norm regularized least-squares (L1-LS) [11][12][13].

Clustering prior
The entries of the sparse vector x may have some special structure (clusteredness) among themselves. This can be modeled by modifying the previous prior distribution. 1 We use another penalizing parameter γ to represent clusteredness in the data. For that we define the clustering using the distance between the entries of the sparse vector x by and we use a regularizing parameter γ. Hence, we define the clustering prior to be The new posterior involving this prior under the Bayesian framework is proportional to the product of the three pdfs: p xjy ðÞ ∝ p yjx ðÞ p x ðÞ q x ðÞ : By similar argument s as used in 3.1, we arrive at the clustered LASSO estimator x Clu-Lasso ¼ arg min Here λ, γ are our tuning parameters for the sparsity in x and the way the entries are clustered, respectively.

Bayesian inference in CS applications
Compressed sensing paradigm has been applied to many signal processing areas [31][32][33][34][35][36][37][38][39][40][41]. However, at this time, building the hardware that can translate the CS theory into practical use is very limited. Nonetheless, the demand for cheaper, faster, and efficient devices will motivate the use of CS paradigm in real-time systems in the near future.
So far, in image processing, one can mention the single-pixel imaging via compressive sampling [31], in magnetic resonance imaging (MRI) for reducing scan time and improved image quality [32], in seismic images [33], and in radar systems for simplifying hardware design and to obtain high resolution [34,35]. In communication and networks, CS theory has been studied for sparse channel estimation [36], for under water acoustic channels which are inherently sparse [37], spectrum sensing in cognitive radio networks [38], for large wireless sensor networks (WSNs) [39], as a channel coding scheme [40], localization [41] and so on. A good CS application literature review is provided in [21], which basically is the summary of the bulk of literatures given at http://dsp.rice.edu/cs.
In this chapter, there are examples of CS theory applications using Bayesian inference in imaging, like magnetic resonance imaging (MRI) and, in communication, i.e., multiple-input multiple-output (MIMO) systems, and in remote sensing. First, let us see the impact of the estimators derived above, LASSO and clustered LASSO, in MRI.

Magnetic resonance imaging (MRI)
MRI images are usually very weak due to the presence of noise and due to the weak nature of the signal itself. Compressed sensing (CS) paradigm can be applied in order to boost such signal recoveries. We applied CS paradigm via Bayesian framework, that is, incorporating the different prior information such as sparsity and the special structure that can be found in such sparse signal recovery method is applied on different MRI images.

Angiogram image
Angiogram images are already sparse in the pixel representation. An angiogram image taken from the University Hospital Rechts der Isar, Munich, Germany [42] is used for our analysis.
The image we took is sparse and clustered even in the pixel domain. The original signal after vectorization is x of length N = 960. By taking 746 measurements, and maximum number of non-zero elements k = 373, we applied different reconstruction schemes and the results are shown in Figure 6.

Phantom image
Another MRI image considered is the Shepp-Logan phantom which is not sparse in spatial domain. However, we sparsified it in K-space by zeroing out small coefficients. We then measured the sparsified image and added noise. The original signal after vectorization is x of length N = 200. By taking 94 measurements, that is, y is of length M = 94, and maximum number of non-zero elements k = 47, we applied different reconstruction algorithms used above. The result shows that clustered LASSO does well compared to the others as can be seen in Figure 7.

fMRI image
Another example to apply the clustered LASSO based image reconstruction using Bayesian framework to medical images is a functional magnetic resonance imaging (fMRI), a noninvasive technique of brain mapping, which is crucial in the study of brain activity. Taking many slices in fMRI data, we saw how these data sets are sparse in the Fourier domain. This is

Bayesian Inference 270
shown in Figure 8. We observed the whole data in this domain for the whole brain image. They all share the characteristics we have based our analysis, i.e., sparsity and clusteredness. Then we took some slices which are consecutive in the slice order and took different N, k, and M=2k, on these slices. We can see the numbers at the top of Figure 9, in which the two numbers represent k and N, respectively.
In fMRI, results are compared using image intensity which gives a good ground for a health practitioner to observe and decide in accordance to the available information. The more one have prior knowledge on how the brain regions work in human beings or pets the better priors that one incorporate to analyze the data. So this is an interesting tool for researchers in the future.

MIMO systems
Multiple-input multiple-output (MIMO) systems are integrated in modern wireless communications due to their advantage in improving performance with respect to many performance metrics. One such advantage is the ability to transmit multiple streams using spatial multiplexing, but channel state information (CSI) at the transmitter is needed to get optimal system performance.
Consider a frequency division duplex (FDD) MIMO system consisting of N t transmit and N r receive antennas. Assume that the channel is a flat-fading, temporally correlated channel denoted by a matrix H n ½ ∈ C N r ÂN t where n indicates a channel feedback time index with block fading assumed during the feedback interval. The singular value decomposition (SVD) of H[n]gives where U ∈ C N r Âr and V ∈ C N t Âr are unitary matrices and Σ ∈ C rÂr is a diagonal matrix consisting of r = min(N t , N r ) singular values. In the presence of perfect channel state information (CSI), a MIMO system model can be given by the equatioñ wherex ∈ C rÂ1 is transmitted vector, V[n] is used as precoder at the transmitter, U H [n] is used as decoder at the receiver, n ∈ C N r Â1 denotes a noise vector whose entries are i.i.d. and distributed according to CN 0; 1 ðÞ andỹ ∈ C N r Â1 is the received vector.
Channel adaptive transmission requires knowledge of channel state information at the transmitter. In temporally correlated MIMO channels, the correlation can be utilized to reduce feedback overhead and improve performance. CS methods and rotative quantization are used to compress and feedback the CSI for MIMO systems [43]. This was done as an extension work of [44]. It is shown that the CS-based method reduces feedback overhead while delivering the same performance as the direct quantization scheme, using simulation.
Three methods are compared in the simulations, perfect CSI, without CS and with CS using matched filter (MF) and minimum mean square error estimator (MMSE) receivers for different total feedback bits B = 10 and B =5 .I nFigure 10, sum rates are compared against signalto-noise-ratio (SNR). Using CS, half of the number of bits can be saved. In Figure 11, where the bit-error-rate is plotted against SNR, the CS method has a better bit error rate performance using same number of bits for the CS and without CS cases. These two figures demonstrate the clear advantage of using CS in feedback of singular vectors in rotative based method. The detail is deferred to [43].

Remote sensing
Remote sensing satellites provide a repetitive and consistent view of the Earth and they offer a wide range of spatial, spectral, radiometric, and temporal resolutions. Image fusion is applied to extract all the important features from various input images. These images are integrated to form a fused image which is more informative and suitable for human visual perception or computer processing. Sparse representation has been applied to fuse image to improve the quality of fused image [45].  Figure 10. Sum rate vs. SNR for a 2Â2 MIMO system with and without CS with two streams. We can observe that the performance of the CS method is almost equal to that of the method without using CS while saving half the number of bits.  To improve the quality of the fused image, a remote sensing image fusion method based on sparse representation is proposed in [46]. In these methods, the source images were represented with sparse coefficients first. Then, the larger values of sparse coefficients of panchromatic (Pan) image are set to 0. Thereafter, the coefficients of panchromatic (Pan) and multispectral (MS) image are combined with the linear weighted averaging fusion rule. Finally, the fused image is reconstructed from the combined sparse coefficients and the dictionary. The proposed method is compared with intensity-hue-saturation (IHS), Brovey transform (Brovey), discrete wavelet transform (DWT), principal component analysis (PCA) and fast discrete curvelet transform (FDCT) methods on several pairs of multifocus images. The proposed method using sparse representation outperforms, see Figure 12, better than the usual methods listed here. We believe that our method of clustered compressed sensing can also further improve this result.

Conclusions
In this chapter, a Bayesian way of analyzing data on CS paradigm is presented. The method assumes prior information like the sparsity and clusteredness of signals in the analysis of the data. Among the different reconstruction methods, the convex relaxation methods are redefined using Bayesian inference. Further, three CS applications are presented: MRI imaging, MIMO systems, and remote sensing. For MRI imaging, the two different priors are incorporated, while for MIMO systems and remote sensing, only the sparse prior is applied in Figure 12. Comparison of image fusion methods for remote sensing applications using Brovey, DWT, PCA, FDCT, and the sparse representation methods [46]. the analysis. We suggest that including the special structure among the sparse elements of the data can be included in the analysis to further improve the results.