Supervised Sparse Components Analysis with Application to Brain Imaging Data

We propose a dimension-reduction method using supervised (multi-block) sparse (principal) component analysis. The method is first implemented through basis expansion of spatial brain images, and the scores are then reduced through regularized matrix decomposition to produce simultaneous data-driven selections of related brain regions, supervised by univariate composite scores representing linear combinations of covariates. Two advantages of the proposed method are that it identifies the associations between brain regions at the voxel level and that supervision is helpful for interpretation. The proposed method was applied to a study on Alzheimer ’ s disease (AD) that involved using multimodal whole-brain magnetic resonance imaging (MRI) and positron emission tomography (PET). For illustrative purposes, we demonstrate cases of both single- and multimodal brain imaging and longitudinal measurements.


Introduction
Recently, multiple neuroimaging data sets per subject have become obtainable due to the remarkable development of imaging techniques such as magnetic resonance imaging (MRI) and positron emission tomography (PET), as well as computer resources and technologies. Vandenberghe and Marsden [1] provide a review on the use of PET and MRI integration technology, such as integrated scanning devices, rather than data analysis. Other modalities such as diffusion MRIs (dMRIs) and functional MRIs (fMRIs) are also useful in collecting brain-related information. These multimodal imaging data sets have the potential to provide rich information about human health and behavior, such as brain function and structure, from different perspectives. From multiple measurements of a single-modal (or multimodal) technique, longitudinal changes in the status and combination of neuro † Data used in preparation of this article were obtained from the Alzheimer's Disease Neuroimaging Initiative (ADNI) database (adni.loni.usc.edu). As such, the investigators within the ADNI contributed to the design and implementation of ADNI and/or provided data but did not participate in analysis or writing of this report. A complete listing of ADNI investigators can be found at: http://adni.loni.usc.edu/ wp-content/uploads/how_to_apply/ ADNI_Acknowledgement_List.pdf biomarkers can be observed to support the prediction and early diagnosis of disease and the classification of disease subtypes. Multimodal brain imaging analysis is important in brain-related disease studies. Arbabshirani et al. [2] provide many reviews on the subject. Imaging data analysis makes a substantial contribution to the study of mental disorders. Most singlemodal or multimodal imaging studies concern dementia leading to Alzheimer's disease (AD) [3] (around 300 of the AD imaging studies searched in Ref. [2]). Modalities considered in there are structural MRIs (sMRIs), fMRIs, dMRIs, fluorodeoxyglucose PETs, and Amyloid/Tau PETs. In a recent study, Ref. [4] examined sMRI and cerebrospinal fluid (CSF) markers. Magnetoencephalography (MEG) is also useful as AD biomarker, and its localization using sMRI has high accuracy [5]. Schizophrenia is the second most studied disorder after dementia. Shah et al. [6] provide an example of multimodal meta-analysis. For Huntington's disease, white matter is evaluated using dMRI [7]. For mood disorders (depressive disorder and bipolar disorder), Refs. [8,9] provide a review of the machine learning method. Moeller and Paulus [10] studied the longitudinal prediction of relapse for substance-related disorders using MRI, fMRI, EEG, and PET. Moser et al. [11] studied schizophrenia and bipolar disorder using multimodal imaging data analysis. dMRI is also effective for analyzing these conditions [12]. For developmental disabilities, Ref. [13] investigated volume reductions in attention-deficit hyperactivity disorder (ADHD) with 1713 participants. Aoki et al. [14] reviewed dMRI studies and conducted meta-analysis for ADHD. Li et al. [15] provide a review of imaging studies in autism spectrum disorder. For anxiety disorder, Ref. [16] applied support vector machine (SVM) to multimodal data. They used clinical questionnaires and measured cortisol release, and gray and white matter volumes in subjects with generalized anxiety disorder and major depression and in healthy subjects. Steiger et al. [17] investigated cortical volume, diffusion tensor imaging, and networkbased statistics using multimodal analysis for social anxiety disorder. For borderline personality disorder, Ref. [18] conducted an imaging-based meta-analysis of 10 studies. In cancer research, especially that on glioblastoma multiforme, multimodal imaging analysis is useful for identifying some types of tumors and evaluating patient prognosis (for more details, see [19]). Genome-related data can be regarded as a modality and called imaging genetics when analyzed in combination with imaging data [20].
One important technique for single-and multimodal imaging analysis is prediction, which is useful for the support of disease diagnosis and the selection of treatments [21]. SVM is the most used method not only in neuroimaging but also in the life sciences in high-dimensional data analysis. The random forest method is also useful due to their capability for complex interactions based on the tree model [22,23]. For multimodal analysis, multiple kernel learning [24] and (multimodal) deep learning [25,26] have been developed. Janssen et al. [27] reviewed machine learning methods for psychiatric prognosis. Related statistical methodology appeared as multi-omics in bioinformatics, and Ref. [28] reviewed these methods while introducing an R package, mixOmics.
Analysis for such discovery and evaluation is based on the detection of the buried signal in the noise (irrelevant information). Statistical analysis is useful for this purpose; however, it suffers from the ultrahigh dimensional and complex structure of this data, and appropriate dimension reduction is therefore required. Even if a machine learning method is used, appropriate input (feature) should be specified to obtain interpretable results because the method is feasible for highdimensional procedures but not ultrahigh dimensional ones. A region-of-interestbased analysis was the leading approach. In contrast, whole-brain analysis is more informative, and if it is combined with a data-driven approach, it can potentially obtain undiscovered knowledge. In [29], by using ReliefF [30], features such as the fractional amplitude of low-frequency fluctuations from resting-state fMRIs, segmented gray matter from sMRIs, and fractional anisotropy from dMRIs were extracted. Component analysis based on low-rank approximation is a successful data-driven approach in the fields of not only neuroimaging but also other biological and medical big data analyses, including principal component analysis, partial lease squares, canonical correlation analysis (CCA), independent component analysis (ICA), and nonnegative matrix factorization. These methods are organized into a matrix decomposition framework consisting of score and loading (weight) matrices. The score matrix, with same row length as the number of subjects, is regarded as dimension-reduced data and is suitable for application to statistical models. The weight matrix, with the same column length as the number of features in the imaging data, is regarded as the basis images. All these methods, except for ICA, have a derivation sparse approach with a regularized matrix decomposition to pose small weights to zeros, which helps estimation by avoiding irrelevant information. In addition, the resulting weights can be interpreted to mean that the corresponding features with nonzero weights contribute to the basis image, specifically to produce data-driven selections of brain regions related to that component.
These methods also consider another direction in which the application of multimodal imaging data can be extended. Supplementary information from another data set can also be useful for the interpretation of the output. For this purpose, appropriate data fusion or integration techniques are required and are useful for multisite studies. In neuroimaging data analysis, multimodal CCA (mCCA) [31] and mCCA + joint ICA [32] have been developed on the schizophrenia study. Multivariate data fusion approaches were categorized by [33] into asymmetric or symmetric data and blind or semi-blind data in symmetric approach. The asymmetric approach is a regression-type approach and includes specific modalities such as dMRI and electroencephalography. The symmetric approach is a correlation-type approach and allows relationships in both directions. Kawaguchi [19] constructed a risk score for glioblastomas based on MRI data and proposed a two-step dimension-reduction method using a radial basis function-supervised multi-block sparse principal component analysis (SMS-PCA) method. Kawaguchi and Yamashita [34] proposed a more general case including a PLS or CCA framework and applied it to MRI, PET, and SNP data sets. Yoshida et al. [4] analyze imaging and non-imaging data with network structure by using the PLS.
In this chapter, we applied SMS-PCA to MRI and PET data sets and a longitudinal MRI data set. One of the key features in the analysis is a multi-block technique which can achieve structural dimension reduction with interpretable parameters (weights for each data set and the possibility of combining them). Although it is not the focus of this chapter, the dimension reduction prior to SMS-PCA is conducted using 3D basis functions. Specifically, our dimension reduction takes place in two steps, and, as described in [35] which applied these techniques to longitudinal study, this two-step approach yields a composite basis function expression with a flexible shape. The organization of this chapter is as follows. Section 2 describes the methodology of the SMS-PCA, which is applied to real data in Section 3. The characteristics of the method, found through its application, are discussed in Section 4.

Methods
We describe the proposed method in this section. The contents are similar to Ref. [19].

Priory dimension reduction
, n is the n Â N matrix whose column corresponds to the vectorized original image data. As the dimensions for each mth image are the same, we use the same basis function to reduce the dimension from N to q. X ¼ SB is the n Â q matrix, where B is the N Â q matrix whose jth column corresponds to the vectorized basis function with the jth knot being the center. Note that knots are pre-specified to span the space equally, as shown in Figure 1. In this example, four-pixel equal spanning knots are applied.

Objective function
Dimension reduction using the basis function is then followed by the SMS-PCA method, considering (sample) correlations based on data values. We consider score t for n Â q matrices X m , where m ¼ 1, 2, …, M with the following multi-block structure: where w m is the weight vector for the mth sub-block X m and b m is the weight for the superblock. Here, it should be noted that the scores in Eq. (1) are referred to as the super scores, whereas t m ¼ X m w m is referred to as the block score. Figure 2 schematically describes the score structure for the case of M ¼ 2.
Thus, the super score has a hierarchical structure for each individual and can be used in an application such as the construction of a diagnosis score.
When matrix X m is normalized by its columns, the weights subject to w m k k 2 2 ¼ 1 and b k k 2 2 ¼ 1 with Á k k 2 as the L2 norm, where 0≤μ≤1 is the proportion of the supervision, P λ x ð Þ is the penalty function, [P λ x ð Þ ¼ 2λ x j j is used in this study], and λ>0 is the regularized parameter that is used to control the sparsity. The larger value of the regularization parameter λ m has many nonzero elements in w m .

Optimization
The algorithm given in Table 1 is used to estimate the weights in Eq. (1) by maximizing L in Eq. (2). The rationality behind this approach is provided in [19].
Note that the deflation step yields multiple components and has several alternatives; that is, through K time iteration for step. 1 to 3 of the algorithm, we can obtain

Parameter selection
The optimal value for λ ¼ λ 1 ; …; λ M ð Þ ⊤ is selected by minimizing the Bayesian information criterion (BIC): from r deflation steps (the projection of X m onto the r-dimensional subspace). There are several search strategies for optimization, and these are introduced in the software options below.

Software
The statistical software R package msma is provided to implement the method described in Ref. [34] where the SMS-PCA method is a part of the package and the PLS type can also be implemented. The package is available from the Comprehensive R Archive Network (CRAN) at http://CRAN.R-project.org/package=msma. Four-parameter search methods are available. Here, the parameters are λ m and the number of components. The "simultaneous" method identifies the number of components by searching the regularized parameters in each component. The "regpara1st" method identifies the regularized parameters by fixing the number of components and then searching for the number of components with the selected regularized parameters. The "ncomp1st" method identifies the number of components with a regularized parameter of 0 and then searches for the regularized parameters with the selected number of components. The "regparaonly" method searches for the regularized parameters with a fixed number of components.
In this chapter, the "ncomp1st" method was applied with nonzero sparsity when the number of components was selected because, in our experience, the BIC value suffered from the high dimensionality of the data. The basic R code for this method is as follows: tuneparams = optparasearch(X=X, Z=Z, search.method="ncomp1st", maxpct4ncomp=0.5, muX=0. 5) where the argument maxpct4ncomp = 0.5 means that 0:5 λ max is used as the regularized parameter when the number of components is searched and where λ max is the maximum of the regularized parameters among the possible candidates. In order to obtain the final fit result with optimized parameters, the following code should be implemented: fit1 = msma(X=X, Z=Z, comp=tuneparams$optncomp, lambdaX=tuneparams $optlambdaX, lambdaY=tuneparams$optlambdaY, muX = 0.5) For more details, please see the package manual.

Application
In this section, we apply the SMS-PCA described in the previous section to real data. The data used in the preparation of this article were obtained from the Alzheimer's Disease Neuroimaging Initiative (ADNI) database (adni.loni.usc.edu). The ADNI was launched in 2003 as a public-private partnership, led by Principal Investigator Michael W. Weiner, MD. The primary goal of ADNI has been to test whether serial magnetic resonance imaging (MRI), positron emission tomography (PET), other biological markers, and clinical and neuropsychological assessment can be combined to measure the progression of mild cognitive impairment (MCI) and early Alzheimer's disease (AD). We use two types of data set: baseline measurement with multimodal MRI and PET images and repeated measuring MRI images.

Data
Baseline imaging data were collected from 106 subjects with mean ages of 75.2 years for the 54 normal cognitive subjects and 72.9 years for the 27 patients with dementia. This data set was somewhat larger than that of [34] because in this study single-nucleotide polymorphism (SNP) was not considered and subjects with missing SNP data were included. Table 2 summarizes the characteristics of these patients.
We consider imaging data from two modalities, MRI X 1 and PET X 2 , namely, M ¼ 2. The preprocessing method is the same as that used in [34]. For the basis function, we used four-voxel (therefore, h ¼ ffiffiffiffiffiffiffiffiffiffiffiffiffi 3 Â 4 2 p ¼ 6:93) equal spacing knots because of the results of our simulation study. The clinical outcome to supervise is given by Z ¼ 3:17 Â CDR þ 0:11 Â ADAS13 À 0:57 Â MMSE where CDR is the clinical dementia rating score, ADAS13 is the Alzheimer's disease assessment scalecognitive subscale, and MMSE is the mini-mental state examination score. These coefficients were the same as in [34]. The SMSMA method was applied to the data X 1 ; X 2 ; Z ð Þwith parameters μ ¼ 0, 0:25, 0:5, and 0:75.

Results
The original data with dimensions of 2,122,945 (= 121 Â 145 Â 121) was reduced to 7,162 using the basis functions for each imaging data set. The number of components were selected as 8 for all μ = 0, 0.25, 0.5, 0.75, 1. Figure 3 shows the correlation matrix from the dataset with the binary outcome, AD or Normal, and the resulting super scores for each μ.
The correlations between the super scores were small except for μ ¼ 1, and for μ ¼ 0, the second component had a high correlation with the outcome. In contrast, for μ>0, the first component had the highest correlation with the outcome. Table 3 shows the results for the multiple logistic regression model with AD or normal as the outcomes and the super scores as predictors for each μ. of 5% statistically significant components were 3, 4, 3, 3, and 0 for μ = 0, 0.25, 0.5, 0.75, and 1, respectively. The minimum numbers of nonzero subweights were 552, 581, 574, 523, and 1075, respectively. Figure 4 shows the reconstructed subweights Bw 1 and Bw 2 for the MRI and PET data, respectively, overlying a structural brain image shown for the most correlated components with the binary outcome from each of μ ¼ 0, 0:5, 0:75, and 1. The images for μ ¼ 0:25 were similar to those of μ ¼ 0:5, 0:75 and are not shown here. Figure 5 shows the reconstructed subweights Bw 1 and Bw 2 overlying a structural brain image and bar plots for the super-weights (right bottom) in the case of μ ¼ 0:5 for all components.
In each component, the negative and positive sides are represented. These can be interpreted by looking at the sign of the super-weight. Most cases remain on one side of 0 (positive or negative), except for components 5 to 8. The super-weights are similar between MRI and PET.
A 10-fold cross validated ROC analysis ( Figure 6A) was conducted to evaluate the diagnostic probabilities estimated from the multivariable logistic regression mode whose coefficients and p-values are shown in Table 3. For comparison, the single modalities, MRI ( Figure 6B) and PET ( Figure 6C), were also analyzed.
In the case of the multimodal MRI and PET ( Figure 6A), μ ¼ 1 had the highest AUC value (0.984) following by μ ¼ 0:75 (AUC = 0.880). In the case of the singlemodal MRI ( Figure 6B), all values were below the AUC values of the multimodal case. In the case of the single-modal PET ( Figure 6C), μ ¼ 1 and 0:75 outperformed the multimodal case, and the other values (μ ¼ 0, 0:25, and 0:5) did not.

Data
The second data set was a collection of repeated measured imaging data from 68 patients with mild cognitive impairment (MCI). There were two groups, the conversion to dementia MCI (cMCI) group and the stable MCI (not converted to dementia, sMCI) group. MRI data measured at four time points were used. For the cMCI group, the four time points were just before diagnosis of conversion. For the sMCI group, the   Table 3.
Results for multivariable logistic regression analysis.
four time points were from the baseline for the entire period of the study. Groups were matched for age, gender, and intracranial volume. Table 4 summarizes the characteristics of these patients at baseline (at the first image observation).
For imaging data processing, we used the VBM8 toolbox. For the basis function, we used four-voxel equal spacing knots, as in the first study in the previous section. The clinical outcome is given by Z ¼ 0:44 Â CDR þ 0:12 Â ADAS13 À 0:11 Â MMSE: The coefficients were different from those in the first study because the target population was different.

Results
The original data with dimensions of 2,122,945 (=121 Â 145 Â 121) was reduced to 7162 using basis functions for each imaging data set. The number of components   Table 5 shows the results for the multiple logistic regression model with cMCI or sMCI as the outcomes and the super scores as the predictors for each μ. The numbers of 5% statistically significant components were 2, 3, 3, 3, and 2 for μ ¼ 0, 0:25, 0:5, 0:75; 1, respectively. The minimum numbers of nonzero subweights were 724, 736, 749, 753, and 852, respectively.
A tenfold cross validated ROC analysis (Figure 7) was conducted to evaluate the diagnostic probabilities estimated from the multivariable logistic regression mode whose coefficients and p-values are shown in Table 5.
For comparison, the single-modal analysis for each time point was conducted. The fourth time point (MRI4), which is closest to the MCI conversion diagnosis time, had the highest AUC values, and these were higher than the multimodal values ( Figure 8). Figure 9 shows the first component subweights, Bw m (m ¼ 1, 2, 3, 4), for the four time points for μ ¼ 0 and 0.5. In the case of μ ¼ 0:5, the hippocampus area was related to the components, and in the case of μ ¼ 0, the parietal lobe was.   Table 5.
Results for multivariable logistic regression analysis.    Figure 10 shows the corresponding super-weights. This result should be carefully interpreted. For time 4, the sparsest block weights were obtained, and thus the weight values were larger than those of times 1 to 3, which were balanced by the small super-weight. As a result, the super score for this component has the mean value of the block scores.

Discussion
In this chapter, the SMS-PCA method was introduced and applied to multiple measured neuroimaging data sets. The first data set consisted of two different types of images, MRI and PET. The second data set consisted of repeated MRI measurements (the same type of image). These imaging data have many voxels per person which were reduced using the basis function prior to conducting the SMS-PCA. The multi-block feature of the SMS-PCA also caused further reduction in each block, and their summary was obtained in the super level where the weights were the relationship and the scores were used in the prediction model.
One of the key features in the SMS-PCA is that it is supervised and its proportion to (self) variance is parametrized by μ. In each study, the impact of μ was studied. The case of μ ¼ 1 resulted in only supervision, that is, only the correlation between the score and the outcome, without the variance of the score. As in an original PCA, maximizing the variance of the score corresponds to μ ¼ 0, and the correlated variables (voxels) have relatively high weights for each component. Thus, the messy maps for the block weights overlaying the brain in the case of μ ¼ 1 were reasonable. In both applications, because μ ¼ 0:25, 0:5, and 0:75 had similar results, a possible large value in μ<1, or the median value μ ¼ 0:5 with a trade-off, can be selected as optimal.
Repeated measured imaging data analysis was studied in [35] which reduced the imaging dimensions using basis functions but did this independent for each image. In contrast, in this study, the correlation between measurements at different time points is considered. That is, simultaneous temporal and spatial correlation was considered. This approach was limited by the need that the number of images for each individual be the same, and this will be improved in future work. In addition, the method introduced in this chapter can incorporate modalities such as network models which would need to summarize the information into the component. This research is in progress.

Conclusion
Although there is room for improvement in this method, this study showed reasonable results when the method was applied to the dementia study. In conclusion, this data-driven approach would be helpful for exploratory neuroimaging data analysis.

Author details
Atsushi Kawaguchi † Faculty of Medicine, Saga University, Saga, Japan *Address all correspondence to: akawa@cc.saga-u.ac.jp † For the Alzheimer's Disease Neuroimaging Initiative © 2018 The Author(s). Licensee IntechOpen. This chapter is distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/ by/3.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.