## 1. Introduction

Neuroimaging has meant a major breakthrough for the diagnosis and treatment of neurodegenerative diseases. Not so long ago, biomedical signal processing was limited to filtering, modelling or spectral analysis, prior to visual inspection. In the past decades, a number of powerful mathematical and statistical tools have been developed and evolved together with an increasing development and use of neuroimaging. Structural modalities such as computed tomography (CT) or the widely known magnetic resonance imaging (MRI), and later functional imaging techniques such as positron emission tomography (PET) and single photon emission computed tomography (SPECT) provide unprecedented insight in the internals of the brain, allowing the study of the structural and functional changes that can be linked to neurodegenerative diseases. This means a huge amount of data where automatic tools can help to identify patterns, reduce noise and enhance our knowledge of the brain functioning.

Computer-aided diagnosis (CAD) systems in neuroimaging include a variety of methods that range from preprocessing of the images (just after acquisition) to advanced machine-learning algorithms to identify disease-related patterns. Algorithms used in the reconstruction of medical imaging, such as the tomographic reconstruction (TR) or the filtered back-projection (FBP) lay outside the scope of this chapter, focused on the application of CAD systems to neuroimaging.

This chapter starts with an exposition of the preprocessing methods used in different neuroimaging modalities, including registration, normalization and segmentation. We provide references on the algorithms behind well-known pieces of software such as statistical parametric mapping (SPM) [1], FreeSurfer [2] or the FMRIB Software Library (FSL) [3]. Later, the most used computer-aided diagnosis systems in psychiatry, psychology and neurology are described. These include SPM [1] and voxel-based morphometry (VBM) [4], voxels as features (VAFs) [5] and how the computation of regions of interest (ROIs) work in semiquantitative analysis. In the next section, new advances in neuroimaging analysis are presented, starting with the basis of machine learning and classification, including support vector machines (SVMs) [5–8], but also logistic regression [9, 10] or classifier ensembles [11, 12]. Given the characteristics of neuroimaging data, where we study large, possibly correlated, data, the extraction of higher-level features is essential. Therefore, in the last section, we provide an introduction to commonly used image decomposition algorithms such as principal component analysis (PCA) [8, 13–18] and independent component analysis (ICA) [19–22]. Finally, other recent feature extraction algorithms including spatial and statistical methods such as texture analysis [23–31], morphological tools [31–33] or artificial neural networks [34–40] are presented.

## 2. Preprocessing

Preprocessing of the neurological images is a fundamental step in CAD systems as it ensures that all the images, either structural or functional, are comparable. We consider a preprocessing step, an algorithm that, applied after the acquisition and reconstruction of the images–usually a machine-dependent procedure–is intended to produce directly comparable images that represent a certain magnitude. The number and type of procedures to follow in preprocessing differs from one modality to another, although normalization and smoothing are used throughout all of them (see **Figure 1**).

### 2.1. Spatial normalization or registration

The anatomy of every subject’s brain is slightly different in shape and size. In order to compare images of different subjects, we need to eliminate these particularities and transform the images so that the subsequent group analysis or comparison can be performed. To do so, the individual images are mapped from their individual subject space (current anatomy) to a reference space, a common anatomical reference that allows the comparison. This procedure is known as *spatial normalization* or *registration*.

There are a number of algorithms used in image registration, but the procedure usually involves the computation of a series of parameters to map the source images to a template that works as a common anatomical reference (see Section 2.1.2 for an overview on registration algorithms). The most widely used template is the Montreal Neurological Institute (MNI) template.

#### 2.1.1. The MNI space and template

The MNI space is the most widely used space for brain registration and was recently adopted by the International Consortium for Brain Mapping (ICBM) as its standard template. It defines a standard three-dimensional (3D) coordinate system (also known as ‘atlas’), which is used to map the location of brain structures independently of the size and shape of each subject’s brain.

The MNI space was intended to replace the Talairach space, a system based on a dissected and photographed brain for the Talairach and Tournoux atlas. In contrast to this, the MNI created a new template that was approximately matched to the Talairach brain but using a set of normal MRI scans. The current standard MNI template is the ICBM152 [41], which is the average of 152 normal MRI scans that have been matched to an older version of the MNI template using a nine-parameter affine transform.

#### 2.1.2. Registration algorithms

Algorithms used in registration can be categorized in *linear* transformations (being the *affine* transform the most complex) and *non-rigid* or elastic transformations. Affine transformations are applied as a matrix multiplication and include terms for translation, scale, rotation, shear, squeeze and so on. These lineal transformations are applied globally to the image and therefore do not account for local geometric differences. Most neuroimaging software includes some kind of affine registration, including FreeSurfer [2], FSL (via package FLIRT) [3] or Elastix.

The estimation of the parameters is performed via the optimization of a given cost function, the minimum-squared difference between the source image and the template being the most basic. Modern software include more refined functions, for example, Tukey’s biweight function (in mri_robust_template of FreeSurfer) [11], or the mutual information (in FLIRT) [42], that operate under a high-complexity schema involving local and global multiresolution optimization. When working with images of the same modality, the preferred cost function is the minimum-squared difference between the source image and the template, whereas in the case of multimodal registration, the maximization of the mutual information is preferred.

Non-rigid transformations can apply local transformation to align the target image to the template. Many of these non-rigid transformations are applied as a local fine-tuning after a previous affine transformation, although some of them use higher-complexity models that do not need this previous step. Some non-rigid transformations include radial basis functions (RBFs) (thin plate or surface splines, multiquadrics and compactly supported transformations), physical continuum models and large deformation models (diffeomorphisms). Of these, the most popular are diffeomorphic transformations, which feature the estimation and application of a warp field, and they are used in SPM (default) [1], FreeSurfer [2], FSL (FNIRT) [3], Elastix or ANTs.

#### 2.1.3. Co-registration

Sometimes, we use different modalities, usually a functional and a structural image, for the same subject. It would be therefore very useful that these two (or even more) different images spatially match each other, so that any processing can be fit to any of them and applied to all. Since functional images have low resolution, the procedure for performing spatial normalization frequently involves co-registration to the structural image, which has a higher level of detail. Then the spatial normalization (or registration) parameters are estimated on the structural image and applied to all the co-registered maps. In the context of low-resolution functional imaging, affine co-registration to the structural image is preferred.

### 2.2. Smoothing

Despite the spatial normalization applied to the images, differences between subjects are still a problem that can reduce the signal-to-noise ratio (SNR) of the neuroimaging data. This problem increases as the number of subjects increases. To increase the SNR, it is recommended to filter out the highest frequencies, that is, applying a *smoothing*. Smoothing removes the smallest scale changes between voxels, making the detection of large-scale changes easier.

On the other hand, smoothing the images lowers its resolution. Smoothing is usually applied using a 3D Gaussian kernel to the image, determined by its full-width at half maximum (FWHM) parameter, which is the diameter of the smoothing kernel at half of its height. The choice of the size of the kernel is therefore of fundamental importance, and it depends on the signal to be detected. A small kernel will make our further processing confound noise with activation signal, but a larger kernel can parse out significant signal. As a general rule, the size of the kernel should be smaller than the activation to be detected.

### 2.3. Functional MRI-specific steps

The acquisition of functional MRI (fMRI) involves a more complex preprocessing of the images, given its dynamic features. fMRI studies acquire a long sequence of low-resolution MRI images that contain a magnitude known as blood-oxygen-level-dependent (BOLD) contrast. This sequence of three-dimensional volumes is combined into a four-dimensional (4D) volume that conceptually works similar to a video. In this context, the outcome can be much more affected by the subject motion. Therefore, procedures such as slice-timing correction and motion correction are mandatory in fMRI.

#### 2.3.1. Slice-timing correction

In fMRI, the scanner acquires each slice within a single brain volume at different times. Different methods are used to acquire the slices: descending order (top to down), ascending order (bottom to up) and interleaved (the slices are acquired in a certain sequence). The time interval between one slice and another is usually small, but after acquiring a whole brain volume, there might be a difference of several seconds between the first and the last acquisition.

To compensate for the time differences between slice acquisition within a single volume, a *slice-timing correction* is performed by temporally interpolating the slices so that the resulting volume is approximately equivalent to acquiring the whole brain at the same time point. These time differences are especially important in event-driven experimental design, where timing is very relevant. Linear, quadratic or spline functions are used to interpolate the slices.

#### 2.3.2. Motion correction

*Motion correction*, also known as *realignment*, accounts for the undesirable effect of head movement during the acquisition of the data. It is almost impossible for a subject to lie perfectly still during the acquisition of an fMRI sequence, and even the small movements can lead to variation in the BOLD values that, if uncorrected, can lead to activation of false positives.

The correction of this problem usually uses a rigid-body transformation similar to those used in registration. In this case, a model characterized by six parameters that account for translation and rotation is frequently used. The parameter estimation is performed by minimizing a cost function, such as correlation or mutual information, between the volumes and a reference time volume, usually the mean image of all time points.

Sometimes, the movement of the head is so fast that motion correction cannot correct its effects. In that case, the most used approach is to eliminate the images acquired during that fast movement, using an artefact detection algorithm that identifies large variations between images at adjacent time points.

### 2.4. Intensity normalization

Most functional neuroimaging modalities, in contrast to unitless structural MRI images, are the representation of the distribution of a certain contrast over the brain. There exist a larger number of sources of variability that can affect the final values: contrast uptake, radiotracer decay time, metabolism, and so on. In order to establish comparisons between subjects, an *intensity normalization* procedure is required.

Intensity normalization methods are to be linear in nature, since it is essential to maintain the intensity ratio between brain regions, acting on the whole brain. In its simplest form, it consists of a division by a constant. This parameter is often estimated [6, 7] as the average value of the 95th bin of the histogram of the image, that is, the average of the 5% higher-intensity values, in what is known as the *normalization to the maximum*. Another approach, called *integral normalization*, estimates this parameter as the sum of all values of the image. A more complex approach requires *a priori* knowledge of the distribution of intensity in a normal subject. This is designed so that the whole image is divided by the binding potential (BP) [43], a specific-to-non-specific ratio between the intensities in areas where the tracer should be concentred and the non-specific areas.

Then, we have general linear transformations, defined as *Y* = *aX* + *b*. These procedures use estimates of the probability density function (PDF) of the source images and then estimate the parameters *a* and *b*, which transform their original PDF to an expected range. Methods to estimate the PDF parameters range from the simplest, non-parametric histogram [44] to more advanced estimates such as an analysis of covariance (ANCOVA) approach used in SPM [44, 45], or parametric estimates involving the Gaussian or the more general alpha-stable distribution [46], which has been recently tested with great success.

Structural modalities also suffer from some sources of intensity variability, for example, magnetic field inhomogeneity, noise, evolution of the scanners, and so on. Field inhomogeneity causes distortions in both geometry and intensity of the MR images [47], usually addressed via increasing the strength of the gradient magnetic field or preprocessing. Intensity variability is especially noticeable in multicentre-imaging studies, where images should share certain characteristics. To improve the homogeneity of a set of structural images acquired at different locations, the use of quantitative MRI images has been recently proposed [48]. In contrast to typical unitless *T*1-weighted images, quantitative imaging can provide neuroimaging biomarkers for myelination, water and iron levels that are absolute measures comparable across imaging sites and time points.

### 2.5. Segmentation

Segmentation, mostly of structural MRI images, involves a series of algorithms aimed at constructing maps of the distribution of different tissues. The general approach is to separate the image in three different maps containing grey matter (GM), white matter (WM) and cerebrospinal fluid (CSF), although some software can also output data for bone, soft tissue or very detailed functional regions and subregions [49–51]. The procedure is applied after all aforementioned steps, including field inhomogeneity correction, which is essential for a correct segmentation. Here, we provide insight on the most used algorithms for segmentation

#### 2.5.1. Statistical parametric mapping

In SPM, the segmentation procedure uses an expectation-maximization (EM) algorithm to obtain the parameters corresponding to a mixture of Gaussians that represents the tissue classes. Afterwards, an affine transformation is applied using tissue probability maps that are in the ICBM/MNI space. It currently takes normalized MRI images and extracts up to six maps: GM, WM, CSF, bone, soft tissue and air/background.

#### 2.5.2. FMRIB software library

Two algorithms are used in FSL to perform segmentation: the FMRIB Automated Segmentation Tool (FAST) and the FMRIB Integrated Registration and Segmentation Tool (FIRST).

FAST is based on a hidden Markov random field model optimized through the EM algorithm. It firstly registers the brain volume to the MNI space and then segments the volume into three tissue classes (GM, WM and CSF). Skull-stripped versions of the anatomical image are needed.

On the other hand, FIRST is intended to extract subcortical structures of the brain characterized by parameters of surface meshes and point distribution models located in a database, built using manually segmented data. The source images are matched to the database, and the most probable structure is extracted based on the shape of the image.

#### 2.5.3. FreeSurfer

FreeSurfer uses surfaces to perform posterior analysis, such as cortical thickness estimation. Therefore, the main aim of the command reckon_all, which performs most preprocessing steps, is not to obtain new image maps containing tissues, but surfaces that identify the different areas in the brain.

After registration to the MNI305 space, voxels are classified into white mater and other based on their location, intensity and local neighbourhood intensities. Afterwards, an estimation of the bias field is performed using these selected voxels, to correct the image. Next, the skull is stripped using a deformable template model [49]. The hemispheres are separated using cutting planes based on the expected MNI305 location of the corpus callosum and pons, and several algorithms that detect these shapes in the source images. The surfaces for each hemisphere are estimated first using the outside of the white matter mass, and then refined to follow the intensity gradients between WM and GM, and GM and CSF, called the pial surface, which will allow the estimation of cortical thickness [50].

FreeSurfer also implements a volume-based stream designed to obtain cortical and subcortical tissue volumes and label them. This stream performs a different, pathology-insensitive affine registration to the MNI305 space, followed by an initial volumetric labelling. Then, the intensity bias is corrected, and a new non-linear alignment to a MNI305 atlas is performed. The labelling process is a very complex one and is more thoroughly explained in Ref. [51].

## 3. Basic analyses

After performing a preprocessing of the source images, many procedures can be applied to extract the information required for clinical practice. In this section, we focus on those analyses that are more extended in clinical practice. These are currently preferred by medical staff, since they are easily interpretable and require little knowledge of computer science. Nevertheless, they are computer-aided systems that yield significant information to assist in the procedure of diagnosis. In later sections, we develop the application of more advanced systems that make use of machine learning to help in the same procedure.

### 3.1. Analysis of regions of interest (ROIs)

The analysis of *regions of interest (ROIs)* is the most basic computationally aided analysis of neuroimaging. It involves either purely manual or computer-assisted delimitation of regions in both structural and functional imaging and a posterior analysis of the delimited volumes.

A number of analyses can be performed on these regions, depending on the image modality. In the case of structural MRI, a frequent approach is the estimation of the volume of cortical and subcortical structures, called *morphometry*. The delimitation of ROIs is often performed automatically and then manually refined and allows the quantification of diseases that alter the normal distribution and size of GM and WM. This is the case of brain atrophy, a common issue in dementia, or brain tumours.

This ROI analysis is hardly used in fMRI, but very extended in nuclear imaging (PET or SPECT). Since the maps obtained with these techniques quantify the uptake of certain drugs, the total uptake can be obtained as the sum of intensities inside the drawn volume. However, the most used measure in these modalities is a ratio between the intensities in specific and non-specific areas, especially with drugs that bind to specific targets such as dopamine, amyloid plaques, and so on.

#### 3.1.1. Cortical thickness

A specific case of a fully computer-assisted ROI analysis is the *cortical thickness* measure performed in FreeSurfer [50]. As aforementioned, once the GM-WM and GM-CSF surfaces have been estimated, the amount of GM in a direction perpendicular to the surface can be estimated. Combined with the subcortical segmentation algorithms, this allows to a very powerful estimation of the average cortical thickness by anatomical region. Thus, per region GM differences such as atrophy or hypertrophy can be characterized, making it perhaps the most widely used method in neuroscience.

### 3.2. Voxel-wise analyses

To overcome the time-consuming procedure of the traditional analysis of ROIs, several algorithms that act at the voxel level have been proposed. These include the statistical parametric mapping (SPM), a voxel-based morphometry (VBM) or the first machine-learning approach in this chapter, called voxel as features (VAF) (see **Figure 2**).

#### 3.2.1. Statistical parametric mapping (SPM)

*Statistical parametric mapping (SPM)* was proposed by Friston et al. [45] to automatically examine differences in brain activity during functional-imaging studies involving fMRI or PET. The technique can be applied to both single images (PET) and time series (fMRI), and the idea behind it is construct a general linear model (GLM) to describe the variability in the data in terms of experimental and confounding effects.

The level of activation is assessed at a voxel level using univariate statistics, and featuring a correction for type I errors (false positives), due to the problem of multiple comparisons. In the case of time series analysis, a linear convolution of the voxel signal with an estimation of the hemodynamic response is performed, and then the activation is tested against the analysed task.

The representation of the activation is frequently presented as an overlay of the Z-scores obtained for each voxel after the multiple comparisons correction on a structural image. The *Z*-score–or standard score–is the signed number of standard deviations an observation is above the mean. The resulting maps allow a visual inspection of the active brain areas, which can later be related to a certain disease or task.

#### 3.2.2. Voxel-based morphometry (VBM)

*Voxel-based morphometry (VBM)* is the application of SPM to structural MRI images [4]. The principle behind VBM is the registration to a template, and then a smoothing of the structural image so that the smaller anatomical differences between subjects are reduced. Finally, a GLM is applied voxel-wise to all the images, in order to obtain a *Z*-score map that highlights the areas where the differences are greater.

As commented in Section 2.2, the size of the smoothing kernel is an important parameter. A small kernel will lead to artefacts in the Z-maps, including misalignment of brain structures, differences in folding patterns or misclassification of tissue types. On the other hand, a larger kernel will not be able to detect smaller regions.

Newer algorithms expand the idea behind VBM using multivariate approaches, to reveal different patterns. These algorithms include an independent component analysis (ICA) decomposition of the dataset and conversion to *Z*-scores, called source-based morphometry [52] and a multidimensional tensor-based morphometry [53].

#### 3.2.3. Voxels as features (VAF)

*Voxel-as-features (VAFs)* is another voxel-wise approach proposed in Ref. [5] for the diagnosis of Alzheimer’s disease using functional imaging. It can be considered the first machine-learning approach in this chapter, since it features a linear support vector machine (SVM) classifier whose input features are the intensities of all voxels in the images. It has been used in many works [6, 13, 25, 32, 33] as a baseline and an estimation of the performance achieved by expert physicians by means of visual analysis.

Additionally, some improvements can be done over the raw VAF, for example, using statistical hypothesis testing to obtain the most significant voxels, thus reducing the computational load and increasing the accuracy. The weight vector of the linear SVM can be inversely transformed to the dimension of the original images, and therefore provide a visual map that reflects the most influential voxels, in a similar way to the Z-maps of SPM and VBM.

## 4. Advances in brain image analysis

The application of new machine-learning techniques in CAD systems is a current trend. Works on this topic have increased exponentially in the past 10 years, and it is expected to grow even more. Machine learning explores the study and construction of algorithms that can learn from and make predictions on data, and therefore it is very useful in neuroimaging.

Two approaches exist in machine learning. *Supervised learning* explores the patterns that lead to a certain outcome, for example, the brain activation patterns that are related to a certain disease. On the other hand, *unsupervised learning* explores the underlying structure of the data. Machine learning in CAD is mostly based on supervised learning, since it is focused on the prediction and analysis of patterns related to a certain disease.

In its simplest form, a machine-learning pipeline for neuroimaging consists of a single *classifier*, just as the VAF approach that we mentioned. However, classifiers can improve their detection power if higher-level features are extracted from the data, for example, features that represent the distribution of the voxel intensities, the texture of the images or the sources of variance of the maps. This is known as *feature extraction*, and the most common technique is *image decomposition*.

### 4.1. Classification in neuroimaging

We mentioned that the simplest approach to machine learning is classification. Classification basically is induction, that is, using a set of samples extracted from the real world from which we know their class (also known as label or category), and build a model that can identify the class of new samples that were never seen.

Statistical classification is a fertile field, and many classification algorithms are being developed at the moment. A wide range of strategies exist, among them are the following: *instance-based methods*, where the new samples are classified by a measure of similarity with known samples; *hyperplane-based methods*, where a multidimensional function that weights the input features produces a score that identifies the class; *decision trees*, where the information is encoded in hierarchical nodes that test one feature at a time; *classifier ensembles, artificial neural networks* and many others.

Hyperplane-based methods are perhaps the most extended in neuroimaging and among them, by far, *support vector machines (SVM)* [6–8]. They offer high accuracy in high-dimensional spaces and can be expanded with the use of kernels to tackle non-linearly separable data. Furthermore, they are very robust to overfitting (unlike decision trees) and therefore more generalizable. Other frequent methods include the hyperplane-based *logistic regression* [9, 10]—and its multiclass version, the *Softmax classifier*–decision trees and *random forests* [54–56], and ensembles of neural networks [34] or ensembles of SVM [11, 12].

#### 4.1.1. Support vector machines (SVM)

*Support vector machines (SVM)* have been widely used in neuroimaging [5–8] and other high-dimensional problems due to their overfitting robustness. In its linear version, training the classifier is equivalent to finding the coefficient vector that defines the separation hyperplane:

so that the distance between the hyperplane and the nearest point is maximized. That way, the problem reduces to minimize the loss function (based on the Hinge loss):

where *y*_{i} is the class of the i-th training sample, *b* is the bias so that *b*/ is the offset of the hyperplane from the origin in the direction of , and *λ* is the regularization term, used to keep the coefficients in as small as possible. Many methods exist to perform this optimization, for example, gradient descent or primal and dual loss (using quadratic programming).

Despite the fact that SVMs are mostly used in its linear form, an extension to non-linearly separable problems can be made using the kernel trick. By using kernels, we can implicitly transform data to a higher-dimensional space, but without needing to work directly in that transformed space, simply replacing all dot products in the SVM computation with the kernel. A kernel is a function _{M} is the dot product in

#### 4.1.2. Ensembles of classifiers

Ensembles of classifiers are a recent approach to machine learning in which, instead of choosing the best classification algorithm, we train and test different classifiers with different properties, and then combine their outputs to obtain the prediction. In the simplest technique, called *bagging*, we combined the results of the classifiers by voting. This is the strategy used in most neuroimaging works, for example in [11] where the brain was divided into subvolumes and a classifier was assigned to each one. In *boosting* weights are assigned to training samples, and then varied so that new classifiers focus on the samples where previous classifiers failed. Finally, in *stacking,* the outputs of individual classifiers are used as features in a new classifier that combines them. This was the approach used in *spatial component analysis (SCA)* [12], where an ensemble of *M* SVM was trained on *M* anatomically defined regions, and their outputs were combined to define a new decision function based on Bayesian networks.

### 4.2. Image decomposition

Signal decomposition algorithms are the first feature extraction algorithms that we will deal with. They are aimed at modelling a set of samples as a linear combination of latent variables. These latent variables or components can be thought of as the basis of an *n*-dimensional space where each sample is projected and represented as an *n*-dimensional vector. In a general case applied to our neuroimaging data, the source images **X** can be modelled as

where *s*_{i} is the coordinate (or component score) of the current image in the *i*th dimension of the new space defined by all the base vectors (component loadings), and *ϵ* is the error of the estimation. **Figure 3** shows an illustration of the process.

Signal decomposition techniques are widely used in many applications, ranging from one-dimensional signals such as audio or electroencephalography (EEG) to multidimensional arrays, and are frequently applied as feature reduction to overcome the *small sample-size problem*, that is, the loss of statistical power due to a larger number of features compared to the number of samples.

#### 4.2.1. Principal component analysis (PCA)

*Principal component analysis (PCA)* is a decomposition method where each component (here known as principal components) is intended to model a portion of the variance. The order of the components is set so that the greatest variance in the component space comes to lie on the first coordinate, the second greatest variance on the second coordinate, and so on. The set of principal component loadings, **W**, is the set of eigenvectors obtained when applying the eigenvalue decomposition of the covariance matrix of the signal, that is, **X**^{T}**X**. It is not uncommon for neuroimaging specialist to refer to these eigenvectors as *eigenbrains*.

The most frequent approach to compute PCA on a given dataset is by using the singular value decomposition (SVD). This performs the decomposition of **X** in eigenvalues and eigenvectors in the form

Here, **U** and **W** are square matrices, respectively, containing *n* and *p* orthogonal unit vectors of length *n* and *p*, known as left and right singular vectors of **X**; and **Σ**is an *n*-by-*p* rectangular diagonal matrix that contains the singular values of **X**. When compared to the eigenvector factorization of **X**^{T}**X**, it is patent that the right singular vectors **X** of are equivalent to the eigenvectors of **X**^{T}**X**, while **Σ**, the singular values of **X**, are equal to the square roots of the eigenvalues of **X**^{T}**X**, therefore, the set of principal component scores **S** of **X** in the principal component space can be performed as

As can be seen here, PCA does not account for the noise independently, but it integrates it in the model as another source of variance. The component-loading matrix can be truncated (i.e., only the first *m* components are used) to reduce the dimension of the principal component space, which is its most used application.

PCA has been used in many neuroimaging works, mainly used for feature reduction in a classification pipeline of nuclear imaging [13–15], but also in structural MRI [16, 17], functional MRI [18] or EEG signals [8].

#### 4.2.2. Independent component analysis (ICA)

Independent component analysis (ICA) performs a decomposition of the source images, but, unlike PCA, it assumes that the components are non-Gaussian and statistically independent from each other. Independence of the components is assessed via either the minimization of the mutual information or checking the non-Gaussianity of the components (motivated by the central limit theorem).

There exist a number of algorithms that implement ICA, from which the most extended are FastICA [57] and InfoMax [58], although other alternatives, such as CuBICA, JADE or TDSEP, are available. Most algorithms use an initial data preprocessing involving centring (create a zero mean signal by subtracting the mean), whitening and dimensionality reduction (with eigenvalue decomposition, SVD or PCA). Afterwards, the algorithm performs a decomposition of each of the *m* components by minimizing the cost function based on the independence criteria, usually the InfoMax, the Kullback-Leibler divergence or maximum entropy in the case of mutual information algorithms and kurtosis or negentropy for the non-Gaussianity-based algorithms. Most ICA algorithms require the number of sources *m* as an input.

The terms used in ICA vary from the ones used in PCA decomposition, although they roughly represent the same concepts. It is all based on a mixing model, where the original data **X**, known as mixed signal, is considered a mixture of several unmixed sources **S**, and therefore the matrix **W** that projects **X** to the ICA space **S** is known as the unmixing matrix.

A basic version of the FastICA algorithm is intended to find a unit vector for the *p*th component so that the projection maximizes non-Gaussianity, measured using an approximation of negentropy . To do so, first, we initialize the weight vector to random. Then, using the derivative of the negentropy estimates, we perform the update:

This is performed for all components and iterated until convergence. The outputs must be decorrelated after each iteration to prevent different vectors from converging to the same maxima. This is achieved, after pooling all to the unmixing matrix **w** in a matricial form, using the alternative found in [57]

and repeating

until convergence.

Many neuroimaging works have also applied ICA for feature extraction and reduction in classification pipelines [19, 20], but its main application today is in the processing of EEG signals [21, 22].

### 4.3. Other feature extraction techniques

In this section, we address feature extraction techniques other than the aforementioned decomposition algorithms. The philosophy behind them is still to provide higher-level features that allow a feature space reduction to overcome the small sample-size problem, but they are intended to extract and quantify information that otherwise would not be available.

#### 4.3.1. Texture analysis

Texture analysis is any procedure intended to classify and quantify the spatial variation of voxel intensity throughout the image. In neuroimaging, they are more commonly used to classify images or to segment them (which can be also considered a form of classification). Depending on the number of variable studies, we can divide the methodology into first-, second- and higher-order statistical texture analysis. In *first-order statistics*, only voxel intensity values are considered, and values such as average, variance, kurtosis or frequency (histogram) of intensity values are computed.

*Second-order statistical texture analysis* is by far the most popular form. It is based on the probability of finding a pair of grey level at a certain distance and orientation on a certain image. Most algorithms are based in the grey level co-occurrence matrix (GLCM), in which is known as *Haralick texture analysis* [59]. The GLCM can be thought of as a representation of the frequency of the adjacent pixel variations in a certain spatial direction and distance. For a three-dimensional **I** and two different grey levels *i* and *j*, at a given offset *Δ* the co-occurrence value is defined as

where *p* = (*x*, *y*, *z*) is the spatial position where *x* = 1 … *n*, *y* = 1, … *m*, *z* = 1, … *k*, and

Higher-order analyses include the grey-level run length method (GLRLM) [26] that computes the number of grey-level runs of various run lengths. Each grey-level run is a set of consecutive and collinear voxels having the same grey-level value. Other texture-based feature extraction methods that have been applied in neuroimaging are the wavelet transform [27], the Fourier transform, used for segmentation [28] and characterization of MRI images [29], or local binary patterns (LBPs) [30, 31].

#### 4.3.2. Spherical brain mapping (SBM)

The *spherical brain mapping (SBM)* is a framework that performs feature reduction of neuroimaging from three-dimensional volumes to two-dimensional maps representing a certain feature [32]. This is done by establishing a spherical coordinate system centred at the anterior commissure (AC) and defining a mapping vector at an elevation (*θ*) and azimuth (*φ*) angles. The sets of voxels **Figure 4**.

An extension using volumetric LBP was recently proposed [31], and another utility that uses hidden Markov models to compute the path in the direction (*θ*,*φ*) using similar-intensity measures was later proposed [33].

#### 4.3.3. Artificial neural networks (ANN)

*Artificial neural networks (ANN)* were inspired by the analysis of biological neural networks in the second half of the twentieth century, but then diverged from their original goal and now are a diverse area of research in machine learning. ANNs are composed by neurons, which model the behaviour of biological neurons by adding up the input signals * x_{i}*, using different weightings

*w*and producing an output by means of an activation function

_{i}*f*(

*x*). Over the time, activation functions such as sigmoid or tan

*h*have been used, although nowadays the most common is rectified linear unit (ReLU), that computes the function

**Figure 5**).

The architecture of an ANN is a series of interconnected layers containing neurons. The perceptron, one of the most basic approaches, comprises an input layer (no neurons), a hidden layer (composed by *n* neurons connected with all inputs and all neurons in the output layers) and one output layer. ANNs have been proven to be universal approximators. This means that given any continuous function *f*(*x*), and some error *g*(*x*) with at least one hidden layer such that

The field is extremely vast and there exist countless architectures, but the most common when applying them to neuroimaging are the *multilayer perceptron* [34] and *self-organizing maps* (SOMs) for segmentation [35] and pattern recognition in functional imaging [36], and more recently *convolutional networks*, which are the base of deep learning, used in segmentation [37], feature extraction [39] and classification [38, 40].

## 5. Conclusions

Computer-aided diagnosis systems are currently a thriving area of research. The bases are already established and contained in widely used software such as SPM or FreeSurfer. The neuroimaging community already uses these in their daily work, both at research and at clinical practice, with great benefit for the patients. These pieces of software usually include preprocessing (registration, intensity normalization, segmentation, etc.) and posterior automated procedures, such as ROI analysis, VBM or SPM, just like we saw in Sections 2 and 3.

In addition to this, state-of-the-art CAD systems involve the use of advanced techniques to characterize neuroimaging data. The field is still being developed and relevant breakthroughs are still to be made. Advances are being made in a daily basis, with the development of new image modalities involving highly specific radiotracers, advanced registration, correction of inhomogeneities or application of existing machine learning and large data algorithms.

In this review, we have revealed a tendency towards fully automated tools capable of processing neuroimaging data, extract information and even predict the likelihood of having a specific condition. It is very likely that neuroimaging techniques will continue to increase its resolution and usage, and in this scenario the amount of data available will grow exponentially. CAD systems involving most of the topics that we covered in this chapter will be therefore crucial in clinical practice to provide understanding of all available information, otherwise intractable. Only this way can we address a major challenge: to discover meaningful patterns related to behaviour or diseases that ultimately help us to understand how the brain works.