On the Relationship Between Residue Solvent Exposure and Thermal Fluctuations in Proteins

The dynamic properties such as temperature factors or B-factors of a crystallographic protein structure result from a complex network of various interactions. To compute temperature factors, the straightforward way is to integrate the long time trajectory of the protein structure using molecular dynamics (Levitt & Warshel, 1975; Warshel, 1976; McCammon et al., 1977) and to compute the root mean square fluctuations of the trajectory, which takes into account all possible interactions in terms of empirical force field. Recently, the elastic network model (Tirion, 1996; Bahar et al., 1997; Ming et al., 2002) has been shown to be quite successful in computing temperature factors. Compared with the usual molecular mechanical force field, the elastic network model is a much simpler model. It is based on a mechanical model that each C atom is connected through a single-parameter harmonic potential function to its surrounding atoms that are within a cut-off distance usually in the range of 7 to 10 A. The mathematical operations in the elastic network model are simply the matrix inversion and diagonalization, no trajectory computation required. It is somewhat surprising that a mechanical model as simple as the elastic network model, which uses only single parameter, i.e., the cut-off distance, can compute dynamics properties such as B-factors with accuracy comparable to the more complex molecular dynamics method.


Introduction
The dynamic properties such as temperature factors or B-factors of a crystallographic protein structure result from a complex network of various interactions. To compute temperature factors, the straightforward way is to integrate the long time trajectory of the protein structure using molecular dynamics (Levitt & Warshel, 1975;Warshel, 1976;McCammon et al., 1977) and to compute the root mean square fluctuations of the trajectory, which takes into account all possible interactions in terms of empirical force field. Recently, the elastic network model (Tirion, 1996;Bahar et al., 1997;Ming et al., 2002) has been shown to be quite successful in computing temperature factors. Compared with the usual molecular mechanical force field, the elastic network model is a much simpler model. It is based on a mechanical model that each C atom is connected through a single-parameter harmonic potential function to its surrounding atoms that are within a cut-off distance usually in the range of 7 to 10 Å. The mathematical operations in the elastic network model are simply the matrix inversion and diagonalization, no trajectory computation required. It is somewhat surprising that a mechanical model as simple as the elastic network model, which uses only single parameter, i.e., the cut-off distance, can compute dynamics properties such as B-factors with accuracy comparable to the more complex molecular dynamics method.
The study of solvent accessible surface area (ASA) has been one of the most important topics in computation biology due to the fact that the residues interacting with other biological molecules are located on protein surface (Connolly, 1983). ASA has been used in the studies of protein stability (Gromiha et al., 1999), protein folding (Eisenberg & McLachlan, 1986), and fold recognition (Liu et al., 2007). The relationships between thermal fluctuation and some concepts related to ASA have been studied. B-factor has shown to correlate to local packing densities (Halle, 2002), atomic distance to protein center-of-mass (Shih et al., 2007;Lu et al., 2008) and residue depth (Zhang et al., 2009). Residue flexibility was known to be correlated to its ASA in single protein (Sheriff et al., 1985) or small dataset (Carugo & Argos, 1997). In a related study, the impact of the ASA of immediate and further neighbors of investigated residues was also noted (Zhang et al., 2009). In this study, we further discuss the relationship between ASA and theoretical thermal fluctuations which are not reported in previous studies.
Since the elastic network model is mainly based on information of local packing of each C atom, one may expect that relative atomic solvent accessibility may qualitatively reflect atomic fluctuation. In this work, we showed that this relation goes beyond just a qualitative one -the profiles of the temperature factors of crystallographic structures are very similar to those of the smoothed amino acid solvent accessibility. Our results show that protein dynamical properties can be inferred directly from the static structural properties without assuming an additional mechanical model. Another interesting corollary from our results is that one may predict temperatures factors from protein sequences with a prediction program of solvent accessibility.

Comparison between B-factor and actual RSA value
In this section, we compare B-factor and relative solvent accessibility (RSA) derived from protein structure. First, we discuss B-RSA relationships when different window sizes are used in smoothing RSA. Second, we show in detail how smoothed RSA (sRSA) are better correlated to B-factor than unsmoothed RSA (uRSA) in several example proteins and the phenomenon is generally true in a large dataset. Third, B-RSA correlations are calculated separately for residues located on protein surface, in the core, and in between.

The B-factor and the relative solvent accessibility profiles
The amino acid solvent accessible areas of the proteins are obtained from DSSP (Kabsch & Sander, 1983). For each protein, we can compute its relative solvent accessibility (RSA) profile. The RSA of the i th amino acid of type x, x i a , is computed from where A i is the solvent accessible area of the i th amino acid and x A 0 is the maximal solvent accessible area of amino acid of type x. The RSA profile of a protein of N residues is denoted as The smoothed RSA profile is computed by averaging the RSA of each amino acid together with those of its n flanking amino acids. In the case of the terminal amino acid, its one-sided flanking amino acids are used twice in averaging. The smoothed RSA profile of a protein of N residues is denoted as , where i a is the smoothed relative solvent accessibility. For convenience, we will refer to the original RSA as uRSA and the smoothed RSA as sRSA. The predicted RSA is denoted as pRSA and the smoothed pRSA as spRSA. The spRSA is computed from pRSA by the smoothing method mentioned above. The Bfactors are extracted from the PDB files of x-ray protein structures. The B-factor profile is given as , they are perfectly anti-correlated.

Data set
The data set (PR972) is selected from PDB-REPRDB (Noguchi & Akiyama, 2003), including 972 protein chains with sequence length larger than 60 residues and pair wise sequence identity smaller than 25%. All structures are solved by x-ray crystallography with resolution better than 2.0 Å and R-factors smaller than 0.2. Detail list of the data set can be found in our previous study on protein structure-dynamics relationship .

Selection of smoothing window size of sRSA profiles
The thermal fluctuation of a single residue is affected by its flanking amino acids because of various interactions between their atoms, for example, hydrogen bonds, van der Waals interaction, electrostatic force, etc. Here, we show that the correlation between B-factor and RSA increases obviously with proper smoothing window size.  Figure 1 shows the average correlation coefficients between B-factor and RSA profiles smoothed by different window sizes, ranging from 1 (unsmoothed) to 25. The calculations are done based on the PR972 dataset. The average correlation coefficient between B-factor and RSA increases greatly from 0.51 (unsmoothed) to 0.60 (window size = 3). It is interesting that the results are similar when the window sizes are between 3 to 9 and become worse when the window size is larger than 9. It suggests that the thermal fluctuation of a residue is mostly affected by the neighbouring residues in this range (window size from 3 to 9). When the window size is too large, it seems that the information from distant residues has little correlation with the thermal fluctuation.
Based on the calculations shown in Figure 1, the window size is set to 3 (with the highest average correlation coefficient of 0.60) in the following sections. Figure 2 shows the B-factor, unsmoothed RSA (uRSA) and smoothed RSA (sRSA) profiles of a typical example: 5-carboxymethyl-2-hydroxymuconate isomerase (1OTG:C). The uRSA profile has a more rugged shape when compared with the B-value profile. If the rugged fine structures of uRSA profile are smoothed out, the global shapes of the B-factor and the sRSA profiles are seen to be quite similar ( Figure 1B). The ruggedness of the uRSA profile is due to that the solvent accessibility of an amino acid may be quite different from its immediate flanking amino acids, but the B values of an amino acid and its immediate flanking amino acids appear to be more correlated.

www.intechopen.com
We computed the correlation coefficient between the B-factor, uRSA and sRSA profiles for the PR972 data set. Figure 3A shows the distribution of correlation coefficient between the B-factor and the uRSA profiles. The median of the correlation coefficients is 0.52 and 62% of the proteins have a correlation coefficient ≥ 0.5. Figure 3B shows the distribution of correlation coefficient between the B-factor and the sRSA profiles. The median of the correlation coefficients is now improved to 0.65 and 86% of the proteins have a correlation coefficient ≥ 0.5. Fig. 3. The distribution of correlation coefficients of (A) the uRSA and (B) the sRSA profiles for the nonhomologous data set of 972 chains (PR972). The shaded areas indicate the sequences that have correlation coefficients ≥ 0.5.

Comparison of B-RSA relationship for residues in different structural contexts
Proteins are strongly interacting with their environment and protein dynamics have been shown to be affected or "slaved" to the solvent around them (Fenimore et al., 2004). Since we have observed close relationship between B-factor and RSA, it is interesting to examine the relationships separately for the surface residues, the partially buried residues and the residues deeply buried in protein core.
The residues in the PR972 dataset are grouped based on the local rigidity of each residue. The rigidity score is computed using the WCN model  which is a cutoff-free contact number model. The rigidity scores range from 0 to 1 and we use two bin sizes, bin size=0.1 and bin size=0.2, to separate the residues into 10 and 5 groups respectively. Figure 5 shows the correlation coefficients between B values and sRSA for each group (bin size=0.1 shown in filled circle, bin size=0.2 shown in rhomb). The results clearly show that the correlations between B values and sRSA decrease as the rigidity scores increase (correlation coefficient=-0.77). In other words, B values are better correlated to sRSA for residues located in less crowded environment or on the surface of the protein.

Comparison between theoretical thermal fluctuation and actual RSA value
Thermal fluctuations in protein can be computed theoretically from its structure. Here, we further compare RSA profiles with theoretical thermal fluctuations computed by two methods, the Protein-fixed-point (PFP) model (Shih et al., 2007;Lu et al., 2008) and Weighted-contact number (WCN) model . These two models are shown to be able to reproduce B-factor from protein structure. The WCN model was further applied to the prediction of NMR order parameters  and the identification of enzyme active sites (Huang et al., 2011). The basic idea of the PFP model is assume that thermal fluctuation of a residue is related to the squared distance between the residue and the center-of-mass of the protein: where i B is the theoretical thermal fluctuation of residue i and i r is the distance between residue i and protein center-of-mass. The WCN model is a cutoff-free contact number model. Unlike traditional contact number calculation, the WCN model assumes that all www.intechopen.com residues interact with each other in a protein. The contact number of a residue is contributed by all other residues but each contact term is weighted inversely by the squared distance between them: where i w is the weighted contact number score of residue i, N is the total residue number of the protein, and ij r is the distance between residue i and j.
Experimental thermal fluctuations are affected by different experimental conditions. For example, two structures of rubredoxin, 1CAA and 1IRO, have almost identical structure (RMSD = 0.44Å) but their B-factor profiles are dissimilar. Figure 6 compares their B-factor and WCN profiles, clearly showing that similar X-ray structures may have very different Bfactor profiles.

Prediction of B-factors from sequences based on predicted RSA profiles
Since the smoothed solvent accessible surface profiles are quite similar to the B-factor profiles, we can indirectly predict B-factors from sequence using the smoothed RSA profile predicted from sequence.

Prediction of relative solvent accessibility
The real value prediction of RSA (pRSA) is computed by the support vector machine (SVM) regression model. The inputs to the SVM are position-specific substitution matrix (PSSM) obtained from PSI-BLAST (Altschul et al., 1997), secondary structure profile predicted by PSIPRED (McGuffin et al., 2000) and hydropathy index. The PSSM profile was obtained after three iterations (E-value = 0.003) against the non-redundant protein sequence database. A 24  N scoring matrix is used as an input to the SVM, which including the PSSM profiles, secondary structure profiles and hydropathy index. Here, the size of the sliding window of the sequence, N, is set to 15. We train the SVM regression model by the commonly used RS126 dataset (Rost & Sander, 1993), a non-homologous data set with pair wise sequence identity less than 25% over a length of more than 80 residues.

The support vector machine
The support vector machine method (SVM) (Vapnik, 1995) has been successfully applied to secondary structure prediction Kim & Park, 2003;Ward et al., 2003), subcellular localization prediction Yu et al., 2004), protein fold assignment (Ding & Dubchak, 2001;Yuan et al., 2003) and other biological pattern classification problems (Dobson & Doig, 2003;Chen et al., 2004;Kim & Park, 2004). The original idea of the SVM is to find the separating hyperplane with the largest distance between two classes.

www.intechopen.com
However, because the data to be classified may not always be linearly separable, the SVM overcomes this difficulty by using the kernel function to nonlinearly transform the original input space into a higher dimensional feature space, so that the data may be effectively separated in the higher dimensional space. SVMs perform well compared with other machine-learning methods because of convenient classifier's capacity control and effective avoidance of overfitting. In this work, the software package LIBSVM (Chang & Lin, 2001) was used.

Prediction results and comparison with other methods
The 2-state model is used to evaluate the performance of RSA prediction. Each residue is assigned buried or exposed by comparing its RSA value with a threshold and the prediction accuracy is defined by the percentage of correctly predicted residues. The prediction accuracies using different thresholds are listed in Table 1. The accuracies of two related prediction methods, SVMpsi and Fuzzy K-NN, are also listed for comparison with our results. The SVMpsi (Kim & Park, 2004) used SVM and the position-specific scoring matrix (PSSM) generated from PSI-BLAST and the Fuzzy K-NN (Sim et al., 2005)  The correlation coefficients between the B-factor profiles and the predicted RSA (pRSA) and smoothed pRSA (spRSA) profiles are computed for the PR972 data set. The correlation coefficient between the B-factor and pRSA profiles is 0.44 and 31% of the proteins have correlation coefficient ≥ 0.5. After smoothing the pRSA profiles, the correlation coefficient between the B-factor and spRSA profiles increases to 0.53 and 55% of the proteins have correlation coefficient ≥ 0.5. Since spRSA and B-factor are well correlated, we assume that the spRSA of each residue is its predicted B-factor value. We found that the results of utilizing spRSA are comparable to one of the current best B-factor prediction results (Yuan et al., 2005). Yuan used support vector regression approach and PSSM information to predict the B-factor distribution from sequence and reported a correlation coefficient of 0.53 on a dataset of 766 high-resolution proteins. For fair comparison, we tested our method on the same dataset. The average correlation coefficient between spRSA and B-factor is 0.52 which is comparable to theirs (0.53). Figure 8 shows the B-factor, pRSA and spRSA profiles of human uracil-DNA glycosylase inhibitor (1UGH:I). The correlation coefficient between the B-factor and pRSA profiles of the protein is 0.68 ( Figure 8A). After smoothing, the correlation coefficient between the B-factor and spRSA profiles is 0.83. The B-factor profiles are better correlated to the spRSA profiles than the unsmoothed profiles, especially in the N-terminal and the P26-S39 residue regions ( Figure 8B). Several methods have been developed to predict B-factor from sequence using different methodologies and testing datasets. The results of these methods have been compared using a small dataset (290 protein chains) (Radivojac et al., 2004). The average correlation coefficients are 0.32 (Vihinen et al., 1994) by a sliding window averaging technique and Bfactor propensities, 0.43 by using neural network and multiple sequence alignment information (Radivojac et al., 2004). There were also structure-based prediction methods, for example, it was reported that the elastic network model has an average correlation coefficient of 0.59 between B-factor and their residue flexibility score (Kundu et al., 2002).

Conclusions
Though the dynamic properties of a protein result from a complex network of various interactions, we found that the B-factors of crystallographic structures are closely related to solvent accessibility directly derived from the protein structure. Our results indicate that dynamic properties such as B-factors can be computed directly from the protein's geometrical shape without assuming any mechanical models. Furthermore, we found that the smoothed solvent accessibility profiles are very similar to the B-factor profiles, and in some proteins, these profiles can overlap with each other almost perfectly. In a dataset comprising 972 non-homologous protein sequences, 86% of the proteins have a correlation coefficient between z B and z sRSA larger than 0.5. The results are consistent with the research by previous work (Zhang et al., 2009) showing the linear relationship between RSA and B-value profiles. In addition, we show that the relationship is not equal for residues in the environments of different rigidity. The correlations are higher for the residues located in the loose regions than those in the rigid environments.
Our results suggest that protein structure and protein dynamics are so closely related that the relative solvent accessibility profile allow one to directly infer the complete B-factor profiles. It will be interesting to investigate further whether there exists similar relationship between B-factors and other structural properties other than solvent accessibility. Recent studies showed a close relationship between the sites of low B-factor values and the active or the binding sites (Yuan et al., 2003;Yang & Bahar, 2005). Our results suggest a potential way to identify active sites from sequence without structure information. Figure 9 shows the spRSA and B-value profiles of type 2 rhinovirus 3C protease (1CQQ). The catalytic residues, H40, E71 and C147 (shown in open circle), are located at the low-mobility regions in the spRSA profile. Since the prediction of solvent accessibility (Ahmad et al., 2003;Kim & Park, 2004) as well as the prediction of secondary structure elements (Qian & Sejnowski, 1988;Rost & Sander, 1993;Jones, 1999) from protein sequences are among the earliest and the best developed methodologies in computational biology, our findings suggest an indirect way of predicting B-factors from sequences -approaches based on various machine-learning techniques such as the support vector machines or the neural networks have been quite successful in the prediction of real value solvent accessibility, and we can borrow these methods and directly apply them to the prediction of B-factors by the use of the smoothed relatively solvent accessibility. We predict the solvent accessibility from sequence using the support vector machine method. The results are comparable to that of the current best methods, with an average correlation coefficient of 0.53 between the B-factor and spRSA profiles over a data set of 972 proteins.

Acknowledgements
Thanks to Tzu-Hui Yang for special supports. This research was supported by the National Science Council and ATU from the Ministry of Education, Taiwan, R.O.C.