Protein Structure Alphabetic Alignment

This study presents a fast approach to compare protein 3D structures with protein structure alphabetic alignment method. First, the folding shape of 5 consecutive residues is represented by protein folding shape code (PFSC) (Yang, 2008) and thus protein folding conformation can be completely described by PFSC. With complete description for folding shape along the backbone, any protein with given 3D structure can be converted into an alphabetic string and aligned for comparison. Consequently, this approach is able to provide a unique score to assess the global similarity in structure while it supplies an alignment table for analysis of local structure. Several sets of proteins with diverse homology or different degrees in complexity are compared. The results demonstrate that this approach provides an efficient method for protein structure alignment which is significant for protein structure search with high throughput screening of protein database.


Introduction
This study presents a fast approach to compare protein 3D structures with protein structure alphabetic alignment method. First, the folding shape of 5 consecutive residues is represented by protein folding shape code (PFSC) (Yang, 2008) and thus protein folding conformation can be completely described by PFSC. With complete description for folding shape along the backbone, any protein with given 3D structure can be converted into an alphabetic string and aligned for comparison. Consequently, this approach is able to provide a unique score to assess the global similarity in structure while it supplies an alignment table for analysis of local structure. Several sets of proteins with diverse homology or different degrees in complexity are compared. The results demonstrate that this approach provides an efficient method for protein structure alignment which is significant for protein structure search with high throughput screening of protein database.
Comparison of protein structures is a challenging task because of complication of 3D structure which involves ambiguous procedure in analysis. First, protein structure obviously is not a simple geometric subject. It is not easily to superimpose two proteins together because the specific emphasis of one portion of structures may cause other parts with similar structures to orient toward different directions in geometric space. In practice, an individual turning point in protein may overshadow entire similarity between two structures. Second, it is hard to develop a uniform process to compare the proteins with different homologies. For protein structures with identical amino acid sequence or with mutation in sequence, the comparison often requires sensitivity to distinguish the conformers with higher similarity in structure. However, for proteins with drastic difference in structural conformation, the good comparison expects a consistent procedure to evaluate the similarity in variant cases. Significant variation of protein conformation is primarily determinate by sequence difference, which affects the formation of hydrogen bond, van der Waals force interaction and disulfide bridge. Also, the protein conformation may be changed by other factors, such as solvent effect, protein-protein interaction, ligand docking and so on. From view of topological order of secondary structure, if two structures belong to different categories in protein classification, such as under different families, superfamilies, folds and class, the structural comparison becomes more difficult. An ideal method should have a consistent process to assess the similarity for proteins with various homologies in structures.
Many established methods for protein structure comparison were developed and evaluated (Kolodny et al., 2005). DALI method (Holm & Sander, 1993;Holm & Park, 2000) is frequently used in protein structure comparison based on the alignment of distance matrices.
For optimistic solution, most of methods attempt to find out higher number of equivalent residues while obtain lower value of root-mean-square deviation (RMSD) through superimposition of protein 3D structures or alignment of structural fragments. Unfortunately, it is tough to optimize these two parameters simultaneously because the intention of higher number of equivalent residues leans higher RMSD, or the favor of lower RMSD leads less number of equivalent residues. In protein structural superimposition two factors, the cutoff distance for RMSD and the initiative focusing location, may be artificially adjusted. These artificial factors are not unique for various methods and they may be changed on case-by-case basis with using same method. Apparently it directly affects the outcome of protein structural comparison. So, it is not surprised that with different methods or even same method, it may produce different values of RMSD and different numbers of equivalent residues. Consequently, different methods may generate unlike rank of similarity in assessment of proteins structures.
The structural alignment is a popular approach for protein comparison which has been developed by different strategies. First strategy is the rigid body alignment, which directly superimposes two proteins with possible best fitting to obtain the lowest RMSD and higher number of equivalent residues. Second strategy is the non-rigid body alignment, which allows smaller structural fragments of proteins with certain flexibility to orient or shift for better fittings, and then adopts various algorithms of measurement for similarity. However, no matter how the protein structure is partitioned, the acquisition of optimum result still involves obtaining the lowest RMSD and highest number of equivalent residues, which are two of contradictory adjustments. The attempt of direct alignment of geometric objects is difficult because no unique resolution is able to handle a geometric object of more than three points with no double superposition. In order to avoid direct alignment of geometric objects, the structural alphabetic alignment is a solution.
The earliest application of structural alphabets was the reorganization of the secondary structure in protein, and then adopted letter "A" for -helix, "E" for -strand and "C" for coil. Furthermore, the structural alphabetic methods (Brevern et al., 2000;Kolodny et al., 2002;Micheletti et al., 2000;Rooman et al., 1990;Schuchhardt et al., 1996;Unger et al., 1989;Sander et al., 2006;Tung et al., 2007;Ku & Hu, 2008;Karplus et al., 2003;Murphy et al., 2000) have been developed for more detail assignment for representative folding shapes. Different approaches in structure alphabets defined different length of peptide and adopted different number of prototypes for folding shapes. With pentapeptide motif, Protein Blocks (PBs) method determined 16 of folding shapes and use alphabets represent these primary prototypes (Kolodny et al., 2002). Thus, it was applied to protein structural alignment (Brevern, 2005;Joseph et al., 2011). Based on different designs in structural alphabets, a variety of methods of structural alphabetic alignment have been developed (Ku & Hu, 2008;Karplus et al., 2003;Melo & Marti-Renom, 2006;Friedberg et al., 2007;Guyon et al., 2004;Sacan et al., 2008;Wang & Zheng, 2008). The performance of structural alphabetic alignment approaches are significantly faster than the methods based on 3D structural comparison, and the unambiguousness is avoided during structural superimposition. However, to date the prototypes of folding shapes in structural alphabetic methods are obtained by observations from training database, and then the primary motifs for folding patterns are determined by statistics judgment. With training database, the experimental observations may collect most of folding patterns with higher frequency of appearance in protein, but may leave out certain folding shapes as leak because of its rare appearance in proteins. Also, each prototype of folding pattern or alphabet is isolated without association meaning. A recently developed structural alphabets approach, protein folding shape code (PFSC) (Yang, 2008), overcomes the shortcomings, which is comprised by complete folding patterns for motif of five residues, and all folding patterns have the meaningful interrelated relationship.
In this study, a set of 27 PFSC vectors is used to describe the folding shapes of protein structure, and to apply to structural alignment. The 27 PFSC vectors are rigorously obtained by mathematical derivation to cover an enclosed space, and represent all possible folding shapes for any five of successive C atoms (Yang, 2008). The 27 PFSC vectors are symbolized by 26 alphabetic letters plus $ symbol, which are capable completely to describe the change of protein folding shapes along protein backbone from N-terminus to Cterminus without gap. With complete description of folding shape for any given protein 3D structure, a consistent method for alignment of protein structures is developed, which is able to assess the structural similarity with various homologies.

Conversion of alphabet description
The protein 3D structure is first converted into alphabetic description with protein folding shape code (PFSC) (Yang, 2008). With PFSC approach, a set of 27 PFSC vectors represent all possible folding shapes for each five successive C atoms. The 27 PFSC vectors, prototypes of folding shapes and alphabets are shown on top of Fig.1. The 27 PFSC vectors are able to map all possible folding shapes, including the regular secondary structure and irregular coil and loop. The 27 PFSC alphabetic codes are able to describe the change of protein folding shapes along based on five successive C atoms. It provides a complete alphabetic description of protein structural conformation from N-terminus to C-terminus without gap. To take protein structure of 8DFR (PDB ID) as sample, the folding shape of each of each five successive C atoms is converted into one of 27 PFSC alphabetic letters along protein backbone. Consequently, the structural folding conformation is expressed by the PFSC alphabetic description and is demonstrated on bottom of Fig. 1. Fig. 1. The 27 protein folding shape code and the conversion of protein alphabetic description. Top: Three blocks represent three regions of pitch distance of motif for five residues; the nine vectors in each block represent the nine folding shape patterns determined by two torsion angles; each vector is simultaneously represented by a letter, a folding shape pattern and an arrow. The vector characteristic is represented by an arrow line. The "", "" or "*"at each end of vector indicates the folding features similar to -helix, -strand or random coil respectively. Bottom: 8DFR (PDB ID) is a sample to illustrate how protein backbone conformation is converted into PFSC alphabetic description. The folding shape of each five successive C-alpha atoms in a protein backbone from N-terminal to Cterminal is converted into alphabetic description. "A" represents a typical alpha helix with red color and "B" beta strand with blue. The folding shape is derived from secondary structure in pink color, and shape for loop or coil in black.

Protein Folding Shape Alignment (PFSA)
With one-dimensional PFSC alphabetic description, the protein conformation structures are able to be compared by protein folding shape alignment (PFSA) approach (Yang, 2011). Similarly as sequence alignment, the PFSC alphabetic strings for proteins are aligned to match the similarity. The Needleman-Wunsch algorithm of dynamic programming technique (Needleman SB & Wunsch, 1970) is used in the PFSA for structural alignment. Therefore, the structural similarity of two proteins is able to be discovered by structural alphabetic alignment with PFSA approach.
In PFSA approach, a substitution matrix for 27 PFSC vectors is defined according relationship of vector similarity. Within substitution matrix S, each element of similarity matrix S[i, j] is determined by the similarity between PFSC[i] and PFSC [j], which is determined by the integrated relationship of 27 PFSC vectors (Yang, 2008). For identical folding shape, the value S[i, i] = 2; for analogous folding shape, the value S[i, j] = 1 and for different folding shape, the value S[i, j] = 0. The substitution matrix S is displayed in Table 1. In next step, a similarity matrix for two proteins is constructed. According substitution matrix S, all elements of similarity matrix M are able to be determined. It assumes that m and n are the lengths of amino acid sequence for protein A and B respectively. Thus the lengths PFSC strings for protein A and B are m-4 and n-4. With the protein folding shape strings of protein A[3…m-2] and protein B[3...n-2], a similarity matrix M with (m-4) x (n-4) dimension is constructed for a pair proteins of A and B in structural alignment. The third step is to obtain a sum matrix by computing the elements of the similarity matrix according the Needleman-Wunsch algorithm. With the sum matrix, an optimized structural alignment is obtained based on tracing elements from the largest value to smaller value. When the track shifts from diagonal in the sum matrix, it actually tries to reduce the mismatch by insertion of gap for match of identical or analogous folding shape. Table 1. The substitution matrix of 27 PFSC vectors. The top row and the left column list the 27 PFSC letter. The value of element in substitution matrix is 2 for identical folding shape code; 1 for analogues folding shape code; empty means zero for different folding shape code.

Similarity score
With optimized alignment, the protein structural similarity score is calculated. Each match of identical folding shape is assigned by 2; analogous folding shape 1; different folding shape 0; penalty of open a gap -2 and penalty of extended a gap -0.25. The value of protein folding structure alignment score (PFSA-S) is determined by the total contribution of identical folding shapes, analogous folding shapes and gaps. The score is normalized with below function.
Here ID FS is the number of identical folding shapes, AN FS the number of analogous folding shape, GPO the number of open gaps, GPE the number of extended gaps and TSQ is the length of PFSC of protein. The denominator in formula, 2 x TSQ, assures the value of PFSA-S to equal numeral one for comparison of two identical structures. When similarity between two protein structures decreases, the value of PFSA-S will decrease. When two proteins have less similarity, the structural alignment produces larger number of gaps, which may give negative value for PFSA-S and signify no noteworthy similarity existing. For normalization, the value of PFSA-S is limited to larger or equal to zero, so any negative value of PFSA-S is converted as zero. Therefore, the PFSA approach provides a normalized score between one and zero to evaluate the protein structural similarity.

Alignment table
With comparison of one-dimensional alphabetic strings for protein folding conformation, the PFSA alignment table is generated. There are two types of alignment tables, i.e. sequence-dependence mode and sequence-independence mode. For same protein or proteins with mutation, the structural alignment for conformation analysis may prefer the sequence-dependent mode because gap insertion is not necessary. For proteins with different sequence and size, the structural alignment takes the advantage of the sequenceindependent mode, which allows inserting gaps to obtain the best match in local structural similarity.
The PFSA alignment table possesses several features. First, the alignment table is able explicitly to reveal the similarity and dissimilarity for local structure. Second, the alignment table exhibits how all similar fragments are matched or shifted with insertion of gaps. Third, it intuitively display how the structural folding shape associates with the corresponding residue of five consecutive amino acids, which is able to assist the analysis of relationship between amino acid sequence-structure-function in protein.

Conformation analysis
Protein structure 1M2F (PDB ID) has 25 conformers obtained by NMR spectroscopy and show in Fig. 2(A). 1M2E (PDB ID) in Fig. 2(B) is the average structural models of 25 conformers of 1M2F (Williams et al., 2002). All of these structures apparently have identical sequence and similar 3D structural conformations. To differentiate the structures with higher similarity requests a tool with higher sensitivity to distinguish each conformer in global and local structure. With PFSA approach, each conformer of protein 1M2F and the structure of 1M2E are converted into one-dimensional PFSC alphabetic description, and then are aligned for comparison. The PFSA alignment table is displayed in Table 2.  1M2F and 1M2E are listed on left column. The amino acid sequence and rule for number of residu protein folding shape code (PFSC) for each conformer is listed following the structure name. The with red color, the -strands with blue color and the tertian fragments with black. Also, the analog structure are remarked with pink color.The PFSA alignment table has the capability for analysis o alignment table does not only align the secondary structure (font with red and blue colors), but it tertian structure (font with black color). Second, the alignment table exhibits the detail element ali within each of fragment of secondary structure. The font with pink color indicates the alteration in or the flexible terminal of secondary structure. Third, the alignment table is able intuitively to rev structural stability or flexibility. For example, in regions of fragment of residues (50-54) and fragm conformations show the fluctuation in 25 conformers, which indicates these two regions with mor protein segments.

www.intechopen.com
Protein Structure Alphabetic Alignment

141
The PFSA approach has capability for evaluation of global similarity. It provides PFSA-S as score to assess the global structural similarity. The 1M2E in Fig. 2(B), as average structure, is compared with each of 25 conformers of 1M2F in Fig. 2(A). The similarity scores are listed with descending order of PFSA-S in Table 3, including the number of identical and number of analogous folding shapes. Also, the results are compared with LGA method (Zemla, 2003). Both of PFSA-S and PFSA alignment table explicitly display the structural difference in protein conformation analysis. Apparently, the PFSA approach has ability to differentiate each conformer with its appropriate sensitivity. LGA:

Name
LGA method (Zemla, 2003); GDT_TS: an estimation of the percent of residues (largest set) that can fit under the distance cutoff of 1, 2, 4 and 8 Å. N: number of superimposed residues under a cutoff distance and RMSD: root mean square deviation of all corresponding C-alpha atoms.

Domain-domain comparison
The proteins belong to different categories in the structural classification of protein (SCOP) (Murzin et al., 1995) are compared. The structures 1M2E in Fig. 2(B) is compared with Nterminal domain of chain A of 1A2O (1A2O-A) in Fig. 2(C) and then its C-terminal domain in Fig. 2(D) respectively. Although, all of three structures are classified as the class of alpha and beta proteins (/), they belong to two of different folds in SCOP. Both structures of 1M2E and N-terminal domain of 1A2O-A belong to Flavodoxin-like fold, but the C-terminal domain of 1A2O-A belongs to Methylesterase CheB fold. The summary of structural classification of 1M2E, N-terminal domain and C-terminal domain of 1A2O-A is listed in Table 4.
First, the alignment table provides the detail information of alignment for local structural fragments. Table 5 shows the comparison of 1M2E and N-terminal domain 1A2O-A while Table 6 shows the comparison of 1M2E and C-terminal domain of 1A2O-A. The alignment tables in Table 5 and Table 6 display how the fragments with similar local folding shapes are matched up with insertion of gaps. In alignment table, the aligned identical protein folding shape code is marketed with "|", the analogue with "*", the different with "^" and the insertion with "+". Actually, the alignment table shows the optimized structural alignment with matching all local structural fragments between two proteins. Second, the PFSA-S provides the quantitative assessment of similarity for global structural comparisons. The PFSC-S values are listed in Table 4, including the numbers of identity and analog of folding shapes, and the number of insertion gaps. In contrast to C-terminal domain, the comparison of N-terminal domain of 1A2O-A and 1M2E have higher PFSA-S similarity score (0.7214 vs. 0.2109), larger number of identical and analogous folding shapes and less number of gaps.
The results reflect the homologous difference of these two pairs of proteins in structure classification.  The protein folding shape code (PFSC) for each structure is listed following the structural n are remarked with red color, the -strands with blue color and the tertiary fragments with black. A with secondary structure are remarked with pink color. The "|" indicates the alignment with ident analogous folding shapes; "^" different folding shapes. The "+" represents the insertion of gaps. The protein folding shape code (PFSC) for each structure is listed following the structural n remarked with red color, the -strands with blue color and the tertiary fragments with black. Also, with secondary structure are remarked with pink color. The "|" indicates the alignment with ident analogous folding shapes; "^" different folding shapes. The "+" represents the insertion of gaps.

Protein comparison
Proteins may be comprised by single domain or multiple domains in the chain structure. To take protein chain-chain in alignment will related to multiple domain comparison. For example, insulin-like growth factor 1 receptor (IGF1R) and insulin receptor (INSR), transmembrane proteins belonging to the tyrosine kinase super-family, have multiple domains in structure. Over the past two decades, rich structural data of IGF1R/INSR has been accumulated, and the sequence alignment was applied in comparison (Werner et al., 2008;Garrett, 1998;Pautsch, 1997;Hubbard, 1997;Garza-Garcia, 2007). In this study, instead, the folding conformations of IGF1R and INSR are directly aligned for structural comparison. The crystal structures of first three domains of L1-CR-L2 structures of IGF1R (PDB ID: 1IGR) (Hubbard, 1997) and INSR (PDB ID: 2HR7) (Murzin, 1995) are available in PDB. The images of first three domains for IGF1R (1IGR) and INSR (chain A of 2HR7) are displayed in Fig.3. Both L1 and L2 domains consist of a right-handed -helix conformation. The CR domain is composed of seven modules with eight disulphidebond connectivity.   The structural is assessed. The sequence similarity is evaluated by the percentage of identical residues. The structural similarity is quantitatively assessed by PFSA score. The similarity of three domains of L1, CR and L2 for IGF1R and INSR are summarized in Table  8. Overall, two protein structures have 60% of sequence in identity with structural similarity score at 0.860. Furthermore, each pair of domains is compared. The L1 domain has 67% of sequence in identity with structural similarity score at 0.909, the L2 domain 64% of sequence in identity with structural similarity score at 0.929 and CR domain 49% of sequence in identity with structural similarity score at 0.749. The PFSA scores specified that the L1 and L2 domains have higher structural similarity than the CR domains. Also, L2 domains have a higher degree of structural homology than L1, even though L1 has a higher degree of identity of sequence. With PFSA approach, the quantitative assessment of similarity between IGF1R and INSR agrees with previous quality specifications by sequence alignment. However, detail structural features are exposed for comparison.  The PFSA approach provides an unambiguous procedure for protein comparison based on structural alphabetic alignment. First, the PFSA approach relies on complete assignment of protein conformation. The PFSC provides a complete assignment of protein conformation for any protein with given 3D structure. Without usage of training database, all 27 PFSC are o b t a i n e d b y r e s t r i c t m a t h e m a t i c a l d e r i v a t i o n . E a c h P F S C v e c t o r o r a l p h a b e t i c l e t t e r represents a special folding shape of five successive C  atoms in protein backbone. The folding shape of each of five successive C  atoms in protein backbone is assuredly assigned by one among 27 PFSC vectors. Therefore, the protein backbone from N-terminal to Cterminal gets complete alphabetic assignment for folding conformation without gap. Second, the PFSA alignment of alphabetic strings is a consistent process. The PFSA approach is able to avoid the artificial choice of geometric parameters in structural comparison, such as the adjustment of initiative focusing location, cutoff distance for RMSD and the length of segment. Similarly as sequence alignment, the structural alphabetic alignment provides a fast and steady procedure for protein structure comparison. Third, the PFSA approach is able to handle protein comparison in various homologies, i.e. in wider scope of structure difference. This feature is well demonstrated by results of comparison of conformers in Table 2, comparison of different proteins in Table 5 and Table 6, and comparison of protein with complicated structures in Table 7. Furthermore, the PFSA approach is able to categorize the protein structures according structural classification in homology. With structure classification of protein SCOP (Murzin, 1995;Andreeva, 2008) as gold standard, the PFSA assessed the homologous degree for a set of protein structures, and the distribution of similarity scores, PFSA-S, was overall agreed with the categories in SCOP (Yang, 2011).

Normalized score and unique measurement
With normalization of PFSA-S score, the structural similarity of various proteins is easily assessed. If two structural data are an identical protein structure, the PFSA-S equals one. If the structural similarity decreases, the value of PFSA-S decreases. When the value of PFSA-S is near zero or less than zero, two proteins have large difference in conformation shape. The PFSA-S score is normalized by size of protein. In PFSA approach, the length of protein folding shape string is used as the denominator in formula for normalization when the PFSA-S is calculated. If a pair of two proteins is compared, anyone of proteins may be taken as the referent protein. If a set of proteins are compared with a reference protein, the similarity scores are normalized according the length of referent protein. The PFSA approach provides a unique quantitative measurement to evaluate the similarity in protein structural comparison.

Local structural comparison
The PFSA alignment table is able to compare protein structures in detail. The onedimensional alphabetic string expresses the change of protein folding conformation along backbone. A letter of PFSC represents the folding shape of fragment for five successive amino acids. In alignment table, the protein folding conformations are aligned with similarity. The PFSA alignment table is comprised with the amino acid residues by adhesive to the associated folding shape code. Furthermore, the PFSA alignment table includes physicochemical properties of amino acid residue which are expresses by alphabetic letters as seen in Table 7. Therefore, the PFSA alignment table may become a good tool to study the relationship between sequence-structure-function. The PFSA alignment table has capability to exam the structural similarity as well as dissimilarity. In alignment table, if the local structures match with identical or analogous folding shapes, it reveals the structural similarity; if local structures align with different folding shapes, it exhibits the dissimilarity. Also, some of unmatched local structures are shifted with insertion of gaps to display the dissimilarity. In general, it is hard to straightforwardly expose both of similarity and dissimilarity with protein 3D structural image or computer modeling animation. Protein modeling provides visualization for view of 3D structure, but PFSA alignment table provides digit description for conformation. The combination of application of protein 3D modeling with PFSA alignment table is helpfully to inspect both o f s i m i l a r i t y a n d dissimilarity in protein structures.

Comparison with other methods
Different methods adopt various strategies to study specific geometric parameters for protein structural comparison. With different parameters and approaches, all methods have a common goal trying to evaluate the similarity of protein structures. As complexity, it not surprise there is no unique outcome for protein comparison. In this study, the results from PFSA approach are compared with other methods.

PFSA vs. LGA
LGA method (Zemla, 2003) is an important approach for protein structure comparison. Specially, it is extensively applied for assessment of similarity for protein prediction in Critical Assessment of Techniques for Protein Structure Prediction (CASP) (Kryshtafovych et al., 2007;Moult et al., 2009). The 25 conformer of 1M2F and its average model of 1M2E are compared by both of PFSA approach and LGA method respectively. The results are listed in Table 3, where all structures are ranked by the order of PFSA-S.
LGA method and PFSA approach adopt different strategies to assess the structural similarity.
LGA method is designed to evaluate the longest continuous segments (LCS ) searching for the largest set of 'equivalent' residues that deviate by no more than a specified distance cutoff. GDT_TS is an estimation of the percent of largest set of residues that can fit under selected cutoff distances. A scoring function (LGA_S) was defined as a combination of these values and can be used to evaluate the level of structure similarity of selected regions. However, PFSA takes the fixed length of segment of five successive C  atoms to determine the folding shape, and then directly makes the alignment with structural alphabets. It is not surprised that PFSA and LGA methods present different ranks in structural comparisons. Due to higher similarity, the comparison of 25 conformer of 1M2F requires a tool with sensitivity to distinguish structural perturbation. The PFSA approach provides finer description for folding conformation. Each PFSC code steadily represents the folding shape of five successive residues and each of PFSC vector can be transformed from one to another. 27 PFSC vectors cover all possible folding shapes. Therefore, each conformer of 1M2F acquires a complete assignment along protein backbone, so the alignment is performed with full length of structure from N-terminal to C-terminal. Furthermore, with structural alphabets, the PFSA adopts an unambiguous process in alignment for protein comparison. Except similarity score PFSA-S, with folding shape for each five residues, the PFSA approach provides explicit comparison in alignment table. Therefore, the PFSA approach offers a complementary tool in analysis of protein conformation.

PFSA vs. CE
The combinatorial extension (CE) method (Shindyalov & Bourne, 1998) breaks each structure in the query set into a series of fragments that it then attempts to reassemble into a complete alignment. A series of pairwise combinations of fragments are used to define a similarity matrix through which an optimal path is generated to identify the final alignment.
The size of each aligned fragment pairs is usually set to empirically determined values of 8 and 30 respectively. One group of 20 structures, the quaternary complex of cAMP dependent protein kinase, has certain structural similarities and is compared with the s t r u c t u r e o f 1 A T P -E b y C E m e t h o d a n d P F SA approach respectively. The results of comparisons between 1ATP-E and 20 of cAMP dependent protein kinases are listed in Table  9 which is sorted in the order by Z Score of CE. Two conclusions are observed from results. First, the ranks of similarity are overall agreed between CE and PFSA, except the structures with number 5, 8, 9, 10 and 18. With PFSA approach, the assessment of similarity is an aggregate of matched folding shape, structural topological distribution, gap and size of protein. Second, the CE method indicates that 20 protein structures have similar fold as structure of 1ATP-E. However, the PFSA has capability further to distinguish the dissimilarity between 20 structures of cAMP dependent protein kinases. According to CE method, if Z Score is larger than 3.5, the compared proteins have similar fold in structure. The values of Z Score of 20 structures of cAMP are from 3.9 to 7.9, so they all have similar fold structure as 1ATP-E. The values of PFSA-S for 20 structures are distributed in the wide range of 0.9145 -0.0565. According to PFSA approach, the value of PFSA-S is near one when two structures have high similarity, and on the contrary, the value of PFSA-S is near zero when two structures with less similarity. The PFSA-S value 0.0565 is for comparison between 1ATP-E and No. 20 of structure. PFSA-S near zero indicates that the pair of structures is dissimilar. It is noted that the sequence length of structures 20 is 94. To compare with 1ATP-E, two structures have big difference in length and the alignment generates 228 gaps which give the lower value of PFSA-S. Therefore, the PFSA has ability to distinguish the structural deference in more detail.

PFSA vs. other methods
A set of 10 pairs of proteins with lower structural similarity was recognized as difficult structures for comparison, and was evaluated by VAST (Madej et al., 1995;Gibrat et al., 1996), DALI (Holm & Sander, 1993;Holm & Park, 2000), CE (Shindyalov & Bourne, 1998), Prosup (Lackner et al., 2000) and LGA (Zemla, 2003) methods respectively. The structural similarity was evaluated by two optimistic parameters, i.e. lower RMSD and larger number of equivalent residues. It is apparently, in Table 10, that various methods gave comparative results for each pair of proteins. The results from various methods provide complementary information for protein structural comparison. Overall, the ProSup and LGA methods provided consistent results with restriction of RMSD less than 3.0. The PFSA, however, offers new observation for assessment of similarity of protein structures. First, the similarity is able to be evaluated by a single value of the PFSA-S. In order to compare with other methods, information of (sum of number of identical and analogous shapes) / (number of gaps) / (PFSA-S) is listed in Table 10. The similarity score of PFSA-S is determined by number of identical shapes, number of analogous shapes and number of gaps. Second, the value of score PFSA-S may judge the similarity crossing isolated comparisons, i.e. the values of score PFSA-S from unrelated comparisons can be used to assess the protein homologous degree. Each pair of proteins in Table 10 is a lonely comparison without common reference structure, but the value of PFSA-S may indicate which pair of structures has higher similarity. In Table 10, the results of each pair of comparison are sorted according the values of PFSA-S descendingly. For example, the pair of comparison of 1CEW-I and 1MOL-A has the PFSA-S = 0.564 and the pair of comparison of 1CID and 2RHE has the PFSA-S = 0.384. A conclusion may be obtained that the pair of 1CEW-I and 1MOL-A has higher structural similarity than the pair of 1CID and 2RHE. Third, relative size of compared proteins makes the contribution to structural similarity in PFSA approach. With various methods, the value of RMSD is often used to make judgment of structural similarity. For example, the pair of comparison of 1CEW-I and 1MOL-A have the RMSD (VAST: 2.0, DALI: 2.3, CE: 2.3, ProSup: 1.9and LGA: 2.0); the pair of comparison of 1TEN and 3HHR-B have the RMSD (VAST: 1.6, DALI: 1.9, CE: 1.9, ProSup: 1.7 and LGA: 1.9). Both pairs have lower RMSD than other remaining pairs and have overall agreement with various methods. However, PFSA approach distinguishes these two pairs by PFSA-S. With PFSA-S = 0.456, the pair of 1TEN and 3HHR-B is ranked below other five pairs, including the pair of 1CEW-I and 1MOL-A with PFSA-A = 0.564. The separation is explained by a factor that the pair of 1CEW-I and 1MOL-A has comparable length of sequence (108 : 94), but the pair of 1TEN and 3HHR-B has larger different in length (99 : 195). The contribution of relative difference of size is counted in PFSA approach. Therefore, with normalization of PFSA-S, for separated comparisons, the similarity degree still can be evaluated without common reference protein.

Conclusion
The PFSA approach adopts the vector of folding shape of five residues as element, and the geometric feature of folding shape is embedded by alphabets as representation. With application of alphabets, the alignment of protein structures is straightforward and steady. This study demonstrates two advantages in PFSA approach. First, 27 PFSC vectors are able to cover all possible folding shapes of five successive C atoms in protein. This is fundamental important because it offers a complete description of folding conformation for any protein with given 3D structure. Second, with consistent procedure, the PFSA approach generates unique score for similarity and detail information in alignment table, which provides new observation for the protein structure comparison.

Acknowledgments
This work was supported in part by a grant from the Indiana Spinal Cord and Brain Injury Research Fund (2009-2011. The algorithms of PFSC and PFSA have been coded with Java (J2SE v.1.5.0_07) computer language. Requests for additional information will be accepted via e-mail to info@proteinshape.com, jiaan@microtechnano.com, jiaanyang@comcast.net or via Website: http://www.proteinshape.com.