Reliability Maps in Event Related Functional MRI Experiments

In functional magnetic resonance imaging (fMRI) studies, the blood oxygen level-dependent (BOLD) signal change, in contrast to noise, is typically small (< 5%; e.g., Chen & Small, 2007). Although the quality of acquired image data may be improved by pre-processing images with lowor high-pass filters, classification of voxels into the active/inactive status could vary from one study to the next even when the same experimental paradigm is implemented (Maitra, 2009). Reliability assessment would contribute significantly to the knowledge on noise structures in image data, as a function of stimulus sequences, ethnic groups, imaging techniques and scanner differences (Biswal et al., 1996; Genovese et al., 1997; Maitra et al., 2002).


Introduction
In functional magnetic resonance imaging (fMRI) studies, the blood oxygen level-dependent (BOLD) signal change, in contrast to noise, is typically small (< 5%; e.g., Chen & Small, 2007). Although the quality of acquired image data may be improved by pre-processing images with low-or high-pass filters, classification of voxels into the active/inactive status could vary from one study to the next even when the same experimental paradigm is implemented (Maitra, 2009). Reliability assessment would contribute significantly to the knowledge on noise structures in image data, as a function of stimulus sequences, ethnic groups, imaging techniques and scanner differences (Biswal et al., 1996;Genovese et al., 1997;Maitra et al., 2002).
In the literature, there have been two main approaches to quantifying reliability of activation. The first involves the analysis of fMRI data acquired in a group of subjects (or more than one group) performing the same task in different days under multiple experimental conditions. The noise structure can be assessed by the intra-class correlation (ICC) analysis (Brennan, 2001;McGraw & Wong, 1996), which provides individual sources of noise associated with experiment-specific conditions (Aron et al., 2006;Fernandez et al., 2003;Franco et al., 2009;Friedman et al., 2008;Manoach et al., 2001;Miezin et al., 2000;Raemaekers et al., 2007;Specht et al., 2003;Zuo et al., 2010). The second approach considers the same group of subjects in multiple experimental replications, and evaluates test-retest reliability by modeling the number of times out of all replications, that a voxel is consistently classified as active given a decision threshold, as a mixture of binomial random variables Noll et al., 1997). This statistical approach has been extended by incorporating more accurate mixtures distributions and optimization procedure for estimating test-retest reliability (Gullapalli et al., 2005;Maitra et al., 2002).
Other than studying noise structures, reliability analysis would also provide information on invariant brain activity during the experimental session as a useful addition to the conventional measurement of response amplitudes. In a study using the forward-backward viewing movies paradigm, for instance, Hasson et al. (2010) have shown that brain responses in the visual cortex are highly reliable between subjects for both forward and backward presentations; responses in other cortical regions such as the precuneus, lateral sulcus, temporal-parietal junction tend to be less reliable in the backward presentation. However, disrupting the viewing order has no effect on response amplitudes in major cortical regions; markedly though, reliability magnitude varies in these regions. This type of studies has introduced dissociation between persistency and amplitude in brain activity.
The event-related paradigm was originally proposed for detecting transient BOLD responses to brief stimuli or tasks, but its potential use is not limited to short-term stimulation (Josephs et al., 1997). In statistical analysis of event-related fMRI data, a few design contrasts must be specified to estimate stimulus and task effects (Friston et al., 2002;Strother et al., 2004;Worsley et al., 2002). Conventionally, the ICC or test-retest reliability indices have been computed across experimental replicates using the t-or F-values, which are standardized parameter estimates in a linear model with BOLD responses as the dependent variable and a few design contrasts as regressors. The design contrast constitutes a hypothesis on temporal behavior in the brain following the stimulus or task onset.
In this chapter, we outline a reliability analysis procedure applicable directly to BOLD responses (i.e., image intensity) without a prior specification of design contrasts. In a sense, the procedure assesses the persistency in BOLD responses during the experimental session. Nonpersistency implies that a brain region is either heavily contaminated by noise or possibly contains a transient response, the onset of which is not reproducible between replicates. In applications, the procedure would suggest a collection of stable brain responses and their spatial distributions that may or may not be easily modeled or detected by using a weighted linear sum of a few basis functions (Lindquist et al., 2009). For instance, it might not be immediately clear how to specify design contrasts for a relatively longer duration of stimuli (> 40 sec per event) or for analysis of spontaneous brain activity under the eyes-closed and -open states. Stable BOLD responses can be further classified into distinct types featuring the time to response peak, amplitude, duration and sign (increased or decreased responses).
In the method section, we will elaborate the step-by-step procedure for assessing reliability of BOLD responses, testing reliability indices for statistical significance, and constructing reliability maps. For illustration, the method will be applied to an empirical dataset collected in a change detection task using the event-related paradigm (Huettel et al., 2001). Empirical results will show that the criterion of persistency is more sensitive to activity in the grey matter in contrast to that in the white matter. Finally, we will discuss the neurophysiological basis and clinical usage of reliability maps.

Hemodynamic Response Functions
In statistical analysis of fMRI data, it is important to model the BOLD responses as a function of the external stimulus (Buxton et al., 1997;Friston et al., 2000;Obata et al., 2004). By convention, a canonical hemodynamic response function (HRF) can be convolved with the external stimulus function to estimate the responses. The HRF can be formulated using one or two gamma functions to model a slight intensity dip after the response has fallen back to zero (Friston et al., 1998;Lange & Zeger, 1997). The estimated response at a particular scan is then subsampled from the response function specified at the scan acquisition time (Bandettini et al., 1993;Worsley et al., 2002). Canonical HRF assumes an instantaneous short stimulus with a few parameters determined by empirically observing activity in the primary visual cortex (Boynton et al., 1996;Glover, 1999); the function has been well fitted to experimental data in many fMRI studies involving healthy subjects, and is suitable for testing hypotheses on the strength and location of brain activation. However, the function may be ineffectual with experiments involving younger children or clinical patients.
There is an increasing amount of literature showing significant variability in HRFs between brain regions and subjects (Aguirre et al., 1998;Handwerker et al., 2004). The HRF variability also appears between experimental sessions recorded in different days on the same subject (Neumann et al., 2003). If variability in HRFs is known to be quite large, an empirically derived HRF can be used instead of the theoretical one in the generalized linear model (GLM) analysis of fMRI data (Handwerker et al., 2004). However, the variability problem cannot be precisely resolved by inserting an empirical HRF into the GLM because the HRF onset time and latency also varies seriously between brain regions especially in event-related experiments with long-term stimuli. In addition to microvasculature disturbances to variability in HRFs, the temporal behavior of BOLD signal has been found stable in repeated trials recorded in a single session (Aguirre et al., 1998;Miezin et al., 2000;de Zwart et al., 2005). Conditional on a fixed brain region, the HRF shape and amplitude can be highly reproducible within a subject.
Numerous studies in recent years have reported the relative efficiency of different HRF models, including finite impulse response models using basis functions and extension of the canonical HRF to more complicated situations with possible temporal and dispersion derivatives (Lindquist et al., 2009;Stephan et al., 2007). On the other hand, localization of brain activity can be done by data driven methods such as independent component analysis (ICA) or group ICA methods (Gu & Pagnoni, 2008;Varoquaux et al., 2010), which can extract reproducible components between experimental trials or between subjects. Most data driven methods assume non-Gaussian distributions for unknown temporal behaviors, which would make thresholding more difficult in constructing activation maps based on voxel-wise component scores. As was mentioned, BOLD responses can be highly reproducible in a fixed brain region within each subject. In this chapter we introduce a simple procedure for research into stable temporal behaviors in the brain, which can be further classified into different response patterns for selecting regions of interest (ROIs) or for GLM analysis.

Measures of Reliability
Reliability analysis requires assessment data to be structured in similar events or replicates. Event-related fMRI experiments are normally conducted over a period of time which is split into smaller segments or experimental runs to allow subjects to rest. Different runs can be considered as experimental replicates implemented under the same condition for evaluation of between-run reliability. Although the notation used in this section has been designed for between-run reliability analysis, a generalization of the method to other types of situations can be easily made by analogy (e.g., between trials or between subjects). The ICC index is a prominent statistic for measuring reliability of image data between runs. Here we specify the assumption used for computing the index, and its potential competitors. Let S denote the variance-covariance matrix of image data between M runs in a single voxel. The ICC index can be expressed as where 1 is the summing vector of order M, and tr(S) is the trace of S. The index has additionally assumed that the assessment data between replicates can be expressed as an additive equation except for random measurement errors.
The index is particularly sensitive to the variance within each run; if variances of image intensity vary from one run to the next, the size of ICC becomes smaller. However, the index is unaffected by adding a constant to image intensity within each run. As an alternative to ICC, the agreement index is sensitive to all aspects of between-run variation including the mean image intensity. An interested reader may refer to McGraw & Wong (1996) for a detailed comparison between the ICC and agreement index. In general, the two indices give comparable results when the number of scan volumes increases in each run. In some experimental paradigms, the scale of image intensity is allowed to vary between runs, and the additive assumption could be too restrictive for general applications. For example, tasks with high and low working memory loads are implemented in different runs. There are also reliability indices robust to scale changes; that is, image data in one run can be expressed as a linear combination of those in another run.
In our empirical studies, reliability indices with lesser restrictive assumptions always yield greater index values, but the ordering of voxels according to index values remains unchanged especially with fMRI data. Interested readers may refer to Liou (1989) for a review on robust reliability indices. An alternative approach is to specify a common factor model underlying image data and to compute the reliability index based on factor loadings (McDonald, 1999). If the common factor model is misspecified, the reliability estimate using factor loadings could be seriously biased (Yang & Green, 2010). An empirical comparison between the ICC and factor analysis models for estimating reliability can be found in Luke (2005).
We now assume that the maximum autocorrelation coefficient in a fMRI time series decreases toward zero as the time-lag between the correlated observations increases toward infinity. It follows from a standard result for a weakly dependent sequence of random variables (Peligrad, 1996, Theorem 2.1) that the asymptotic distribution of elements in S can be assumed to be multivariate Gaussian as the number of scan volumes n → ∞, where is the Kronecker product and is the population counterpart of S; vechS denotes the vector of those non-duplicated elements in S, and the operational matrix H M satisfies the identity vechS =H M vecS (Henderson & Searle, 1979). Because ICC is a differentiable function of a multivariate Gaussian vector, the asymptotic variance of ICC can be derived as follows Where d' is the derivative of ICC with respect to vech'S as follows: With a moderate size of n (> 100), it is reasonable to assume that is distributed as a standard Gaussian distribution with mean 0 and variance 1, where µ ICC is the mean value in the population and Std(ICC) denotes the square root of Var(ICC) in (2). In applications, one may hypothesize that µ ICC 0 and test an observed ICC for statistical significance against this hypothesis.

Multiple testing
The ICC value is computed by using temporal information within each individual voxel, and the resulting Z value in (3) can be tested against a nominal Type-I error rate α. In this chapter, we assume that only positive ICC values are acceptable. With α = 0.05 for a onetailed test, for example, voxels with Z 1.64 can be selected for constructing reliability maps. As the number of voxels to be evaluated increases, the likelihood of having at least one Type-I error out of all tests also increases in the experiment. By the Bonferroni inequality, /V, where V is the total number of voxels with positive Z values, the familywise error (FWE) rate can be controlled at α by selecting error rate at B for each test. Because of the spatial dependence among image voxels, the Bonferroni procedure tends to be too conservative in general (Nichols & Hayasaka, 2003).
Alternatively, the multistep test uses a sequence of ordered p-values which are compared against different thresholds. The false discovery rate (FDR) is a step-up test widely applied in neuroimaging studies (Chumbley et al., 2010;Genovese et al., 2002;Langers et al., 2007). The FDR procedure considers FDR i/V α/C V as the critical value for the i-th test in the ordered sequence to control the false positive rate at α, where C V is a predetermined constant. The choice of the constant depends on the joint distribution of p-values in the sequence. For instance, C V 1, in case of the Gaussian noise with nonnegative correlation across voxels, and C V ∑ , in case of no assumption on dependence (Genovese et al., 2002). The FDR procedure is easily implemented even for large data sets, and more powerful than the Bonferroni procedure (Langers et al., 2007;McNamee & Lazar, 2004;Nichols & Hayasaka, 2003).
The random field theory (RFT) methods account for spatial dependence in the data, as captured by the maximum of a random field (Nichols & Hayasaka, 2003;Worsley, 1996). It has been shown that the probability of observing a cluster of voxels exceeding a threshold in a smooth Gaussian RF can be approximated by the expected Euler characteristic (EC). The EC counts the number of clusters above a sufficiently high threshold in a smoothed Gaussian RF. Methods based on the RFT comprise a flexible framework for neuroimaging inference, but RFT relies on the assumptions of stationarity and smoothness (Nichols et al., 2003). Without spatial smoothing on ICC values, the RFT methods yield similar results to the Bonferroni procedure in our applications. In the empirical example, we will only present results based on the FDR procedure with C(V) = 1. An interested reader may refer to Nichols and Hayasaka (2003) for a detailed comparison between different approaches to controlling the FWE.

Reliability Maps
In fMRI experiments, it is reasonable to assume that environmental, physiological, and psychological factors fluctuate randomly throughout the experimental period on a momentby-moment basis, and these random effects occur equally likely in all runs. Imaging techniques, such as pulse sequences, imaging parameters and scanner performance, also affect the quality of observations. Those artifacts, however, may be systematic and nonrandom as long as the same scanners, sequences, and parameters are implemented in the experiments. In addition to regular prewhitening procedures (i.e., slice timing, motion correction, and adjustment for autocorrelation), the fMRI time series in reliability analysis must be corrected for major trend effects in order to account for magnetic field drifts.
There are qualitative descriptions of reliability indices, for instance, poor (<0.00), slight (0.00-0.20), fair (0.21-0.40), moderate (0.41-0.60), substantial (0.61-0.80), and almost perfect (0.81-1.00) (Landis & Koch, 1977). The size of ICC also depends on the number of runs and number of scan volumes. However, the standardized ICC in (3) can take into account the sample size effects. In order to construct the reliability maps, the Z value is computed by substituting sample estimate S for population in (3) for each individual voxel using the preprocessed fMRI time series. Voxels with Z values significantly greater than zero can be selected to construct the reliability maps (e.g., Z ≥ 1.64 with Type I error controlled at α = 0.05). In order to control the FWE, the Bonforroni correction, random field theory, and FDR control methods can be applied (Hochberg & Tamhane, 1987;Worsley et al., 1996). In the empirical example, we only present results based on the FDR method which yields most reasonable findings as compared with the other two methods.
The reliability analysis can be applied to each individual subject as well as to a group of subjects. After normalization of each subject's fMRI scans to a standard brain atlas (e.g., the MNI brain), the group standardized ICC for K subjects can be computed as follows: where ICCj denotes the ICC index corresponding to the j-th subject.
The G values can be tested for significance against a standard Gaussian distribution with FDR control of FWE in the normalized space.

Empirical Example
We illustrate the use of the reliability analysis procedure in an example using the long-term stimulus in the experiment (42 sec per event). The dataset was collected in an event-related fMRI experiment involving 10 subjects for investigating brain functions in a changedetection task (Huettel et al., 2001;fMRIDC Accession No: 2-2001-111T9). There were 10 stimulus trials in each run, and each subject completed 10-12 runs in his/her experimental session. In each trial, a pair of images was presented with difference in either the presence/absence of a single object or color of the object. The subjects made the behavioral response by pressing a button when they felt that there was something changing on the trial. Each trial began with 2 sec fixation cross at the center of the screen as a warning signal, followed by the first 30 sec of the trial, during which two images were presented for 300 ms, separated by a 100-ms mask. The mask was removed during the last 10 sec of the trial, and the stimuli alternated every 400ms. During the experiment, the subjects were instructed to keep their eyes on the display at all times. On the average, the between-subject variation is small by inspecting the behavioral data. Figure 2 plots the frequency distribution of voxel-wise Z values in (3) for each of the 10 subjects. Theoretically, the ICC values lie within the range of (−∞, 1], and the plots suggest that all subjects have similar ICC distributions except for Subjects 7 and 8, whose ICC values are mainly distributed in the negative direction (smaller proportions of positive ICC values). The two subjects have average shorter reaction times as compared with other subjects. During the experimental sessions, instantaneous BOLD responses could occur in the two subjects, which may or may not be related to the task. The image data of these two subjects were later eliminated from the group reliability analysis. In applications, visual inspection on the ICC distributions would suggest removing irregular subjects from the group analysis.
Reliability analysis was applied to each individual subject's data without normalization of image to the MNI template. Figure 3 shows the reliability maps for four subjects whose behavioral data and ICC values have comparable distributions in Figures 1 and 2, respectively. The maps for each subject were constructed by computing Z values in (3) for every voxel, and then testing these values for significance against a standard Gaussian distribution using the FDR control of FWE at α = 0.05. The colored overlays in Figure 3 are those voxels exceeding the FDR threshold, and were shown by using each subject's own anatomy in the background. The time series data in the colored regions were further clustered into different patterns using the k-mean method (MacQueen, 1967). Figure 3 also plots the response patterns that are reproducible between the four subjects derived from the k-mean method. The response functions in Figure 3 suggest that there are at least two types of increased BOLD responses (blue and yellow) time locked to the stimulus onset, and one type of decreased responses (green) also time locked to the stimulus onset. The onset time of one increased response (red) is earlier than the stimulus onset. The time to the response peak varies between the three increased BOLD responses and between the four subjects. The time to the dip in the decreased response also varies between the four subjects. The response function in yellow in the figure shows a minor peak in the last 10 sec of the trial during which the mask was removed between images in the change detection task. In applications, these response functions can be smoothed and inserted into the design matrix of GLM in SPM or FSL for advanced statistical analysis, such as comparing stimulus or task effects in different groups.
In order to illustrate group reliability maps, functional and anatomical images of eight subjects (1, 2, 3, 4, 5, 6, 9 and 10) with more reliable data were normalized to the MNI template. The ICC values corresponding to the same voxel in the normalized brain were computed for each of the eight subjects. The average ICCs across the eight subjects and values were computed for all voxels in the normalized brain. Figure 4 shows the group reliability maps in different brain regions. Brain regions with higher values imply that there were stable temporal behaviors during the experimental session in these regions. The reliability maps might suggest potential ROIs for probing high-level brain functions. The average time series across the eight subjects in different voxels with significant values were clustered using the k-mean method. Figure 6 shows the BOLD responses based on the average time series. According to the figure, there are at least two types of decreased responses (green and brown) in the group reliability maps. Figure 7 shows the brain regions corresponding to different BOLD response patterns in Figure 6. The color overlays in Figure  5 and 7 have the same spatial locations with the normalized anatomy in the background. It is interesting to note that the onset time of responses in the parahippocampal gyrus, posterior cingulate and precuneus (the red plot) is slightly earlier than the stimulus onset. This suggest that a mechanism could be carried over from the previous trial to a new trial. The parahippocampal gyrus participates in novelty perception, and the posterior cingulate and precuneus, in attentional shift. The early-onset response could be induced by a preparatory mechanism in anticipation of an expected task (Sirotin & Das, 2009).
Responses in the parahippocampal gyrus, middle frontal gyrus, precuneus and superior/inferior parietal lobule (the yellow plot) show a second peak in the last 10 sec of each trial during which the mask was removed between two images. The second peak could be induced by the task change (with/without the mask). The early decreased responses (in green) in the insula, supramarginal, angular gyrus and precuneus could be coupled with the early increased responses (in red) carried over from the previous trial. The late decreased responses (in brown) in the cingulate gyrus, superior temporal gyrus and medial frontal gyrus could be coupled with increased responses (blue and yellow) for performing the change detection task. Figure 4: The reliability maps of four subjects participating in the change detection task. The colored voxels have ICC values that are significantly greater than zero by a standard Gaussian test with the FDR control of FWE at α = 0.05. The selected slices are all in the axial sections; the slices in columns (a), (b), (c) and (d) locate approximately at z = -9, -4, +5 and +33, respectively. The BOLD response plots in column (e) are those stable response patterns associated with each subject from the k-mean method. The spatial distribution of each response pattern in the brain is highlighted using the same color. The reliability maps are reported by showing in a subject's own anatomy in the background with colored overlays indicating those voxels with Z values exceeding the FDR thresholds.   The group reliability maps for eight subjects using the same colors corresponding to the response plots in Figure 6. The increased responses in blue are mainly distributed in the fusiform gyrus, lingual gyrus, middle occipital gyrus and cuneus; those in yellow are distributed in the parahippocampal gyrus, middle frontal gyrus, precuneus, and superior/inferior parietal lobule; those in red are distributed in the parahippocampal gyrus, posterior cingulate and precuneus. The decreased responses in green are distributed in the insula, precuneus, and inferior parietal lobule; those in brown are distributed in the cingulate gyrus, superior temporal gyrus and medial frontal gyrus.

Discussion
Comparing with other existing methods, reliability maps are simple in construction, and offer rich information on temporal behaviors of different brain regions. The method is applicable to each individual subject as well as to a group of subjects, and has potential use for investigating the HRF variability between subjects and between brain regions. Without reliability assessment, some functional mechanisms could be overlooked in the analysis such as novelty perception in the last few seconds of each trial in the change detection task. One restriction of reliability assessment is that the fMRI time series must be partitioned into replicates (e.g., experimental runs). This might not pose difficulty in applications. For instance, we applied the same procedure to the resting state fMRI with alternate eyes-closed and -open periods (3 min per state with two replications). It was found that the thalamus showed most reliable results, but BOLD responses in this region were quite different from those in other regions. Regions located in the default mode network (e.g., precuneus) also had moderate sizes of reliability. By the conventional approaches to analyzing resting state data (e.g., independent component analysis, and the seed correlation approach), one would mainly find those regions located in the default mode network. We conclude that persistency in BOLD responses as measured by the ICC index is an important indicator of temporal behaviors in fMRI experiments.
The asymptotic theory supporting reliability maps need not require a large number of scan volumes or any stationarity assumption. In our experience, the procedure works well in datasets involving two runs with 160 scan volumes each. In clinical studies involving patients, the number of experimental replicates might be small (a few runs), and the reliability maps would support information on functional distortions in task execution, especially for those decreased responses in Figure 6. Although the k-mean methods can be applied directly to image data without using reliability maps, those clusters of larger size may give ambiguous response patterns for each subject. As was mentioned, nonpersistency might suggest random noise or transient BOLD responses. In event-related experiments with long-term stimuli, there might be several kinds of transient responses occurring in each trial with different onset times. For example, novelty in perceived images may depend on the order of stimulus presentation. The proposed reliability analysis procedure can only assess those temporal behaviors that regularly occur across trials and runs, which could be the major limitation of the procedure.

Conclusion
There have been an increasing number of event-related fMRI studies in cognitive, psychological, and medical research. The procedure for constructing reliability maps is proposed mainly for experiments using event-related designs involving a relatively longer stimulus trial. Its potential use is not limited to event-related designs, however. The method had successfully identified reliable regions in a block-design experiment with two runs involving alternate blocks of high-and low-load working memory tasks. Reliability maps suggest stable temporal behaviors in the experimental session that vary between brain regions among individual subjects. Reliability maps would assist researchers to select ROIs for further analysis, or to insert the obtained response functions into GLM for testing stimulus and task effects. In clinical studies with a small number of replicates, reliability maps can assist for detecting functional disorders in the brain for each individual patient. We conclude that the proposed procedure would support a stream of research probing more complicated BOLD responses in fMRI studies such as the early-onset mechanism in the change detection task.