Protein Interactome and Its Application to Protein Function Prediction

Diverse molecules interact with proteins to produce a biological function. Proteins exhibit many interactions with other molecules including other proteins, nucleic acids, carbohydrates, lipids, minerals, metabolites, and chemical compounds, resulting in diverse roles within and/or between cells. Some of these proteins locate in subcellular organelles, where they modulate biochemical reactions, and some other proteins locate in membranes mediating various stimuli to signaling pathways. Cellular systems can be represented as complex networks. We may consider the molecules as nodes and the associations among the molecules as edges in the network. In this network, all kinds of the molecular interactions can be referred to as an interactome. Even though all kinds of the interactome are important, we here focus on protein-protein interactions (PPIs) since they are fundamental in cellular systems. To function correctly, a protein should interact with other proteins in the context of complex formation, signalling pathways and biochemical reactions. To perform a specific biological function, these interactions need to be specifically formed with proper interacting partners at the right time and locations.


Introduction
Diverse molecules interact with proteins to produce a biological function. Proteins exhibit many interactions with other molecules including other proteins, nucleic acids, carbohydrates, lipids, minerals, metabolites, and chemical compounds, resulting in diverse roles within and/or between cells. Some of these proteins locate in subcellular organelles, where they modulate biochemical reactions, and some other proteins locate in membranes mediating various stimuli to signaling pathways. Cellular systems can be represented as complex networks. We may consider the molecules as nodes and the associations among the molecules as edges in the network. In this network, all kinds of the molecular interactions can be referred to as an interactome. Even though all kinds of the interactome are important, we here focus on protein-protein interactions (PPIs) since they are fundamental in cellular systems. To function correctly, a protein should interact with other proteins in the context of complex formation, signalling pathways and biochemical reactions. To perform a specific biological function, these interactions need to be specifically formed with proper interacting partners at the right time and locations.
Given the knowledge of genome sequencing on model organisms including human, we have elucidated a large number of unknown molecular structures and interactions within nucleic acids. In the post-genomic era, functional genomics is an emerging area of research that seeks to annotate every bit of information of the genome structure with relevant biological function. Still, many proteins (or genes) remain functionally unannotated (Apweiler et al, 2004;Sharan et al, 2007). These missing links between structures and functions need to be resolved to understand complex biological phenomena including human diseases, development and aging.
Protein function is widely defined in several different ways. It is highly context-and condition-dependent, which means that proteins participate in most biological processes. There have been various attempts to categorize the protein functions . One of them categorized the protein function into three parts: molecular function, cellular function and phenotypic function. First, the molecular function is defined as biochemical reactions performed by proteins. Second, the cellular function is defined as various pathways associated with proteins. Lastly, the phenotypic function is defined as an integration of all physiological subsystems to environmental stimuli.
Aside from the conceptual definition, many annotation efforts on protein function have been undertaken (Table 1). One of these efforts, the Gene Ontology (GO) consortium (Ashburner et al, 2000), made a standard and multi-labelled hierarchical annotation on proteins in the category of biological process, molecular function and cellular component. The GO consortium is regularly accumulating annotations on proteins according to GO category in open databases. In this chapter, we consider the three kinds of GO terms in annotation of protein function.
Many experimental techniques are available for discovering the protein function, such as gene knockout and transcript knockdown, but these approaches are low-throughput and time-consuming. In recent decades novel high-throughput techniques have been developed, and we are now able to analysis genome-wide data, which is broadening our biological insights. Computational methods are necessary for analysing the massive quantity of data and they are complementary with the low-and high-throughput experimental methods.
In this chapter, we first introduce PPI data available through public databases and compare the contents of major databases. We also describe PPI detection methods by experimental and computational approaches. Next, network-and non-network-based computational methods for the identification of protein function are described. Finally, computational prediction methods of protein subcellular localization, especially by exploiting PPI data, are shown.

PPI data
PPI can be considered as one kind of protein interactome. Proteins mutually interact in the biological context for specific functions. Given the knowledge of a single gene, expressing www.intechopen.com distinct transcripts and protein isoforms, a protein also interacts with other proteins including itself to give specific function. PPIs are defined as physical interactions between protein pairs (Bonetta, 2010). There are also non-physical interactions such as genetic and functional interactions. Genetic interaction is typically defined as when two genes are simultaneously perturbed, with the quantitative phenotype being more or less than expected (Mani et al, 2008). Functional interaction between two proteins is a much broader concept than other experiment-derived interactions. It may include any functionally associated gene/protein pairs which are integrated and predicted from heterogeneous data. We will explain these computational prediction methods later in this section.
The physical interactions between protein pairs also can be either direct or indirect. Binary interaction is an example of a direct interaction while indirect interaction includes subunits of protein complex. To give a specific function, proteins often form a large complex including direct and indirect interaction among the participant proteins. These interactions are also separable according to their binding lifetime. Some interactions between protein pairs are transient, with the interactions associating and dissociating under particular physiological conditions. On the other hand, some of proteins form stable complexes where the participants in the complexes permanently interact with each other. Various PPI types are defined in standard and annotated across many PPI databases (Cote et al, 2010;Kerrien et al, 2007).

PPI databases
Currently, there are 132 PPI databases indexed by the Pathguide (Bader et al, 2006;accessed 23 Dec 2011). The quantity of physical interactions to date is 386,495 across all species when integrated among major 11 databases by the iRefWeb accessed 23 Dec 2011). The PPI data derived from both high-and low-throughput experiments are altogether deposited into any of primary databases which manually curate experimental results. These primary databases include not only physical interactions but also genetic interactions and annotate standard minimal information about a molecular interaction (MIMIx) . There is an inconsistency problem related to the literature curation across different databases . Turinsky et al. confirmed that the agreement between curated interactions from 15,471 papers shared across nine databases was only 42% for interactions and 62% for proteins. This result was averaged between any two databases curated from the same publication. Some of the primary databases altogether formed a consortium called IMEx (The International Molecular Exchange) to enhance the quality of literature curation efforts.
Since we have plenty of primary databases, comprehensive integration of those primary databases has become an intriguing research field. Such meta-databases minimize redundancy and inconsistency that are limitations of the primary databases . Moreover, functional interaction databases consist of both experimentally-detected and computationally-predicted data. Sometimes, these predicted and experimental PPIs need to be distinguished for the degree of confidence. They both give useful information but should be separated according to the relevant evidence codes. There are also species-specific functional interaction databases (Lee et al, 2011;Lee et al, 2010a). We have listed some of the major primary databases, meta-databases, and functional databases in Table 2. Comparisons among the primary databases are shown in Table 3. We compared various features including interaction types, detection methods, references, and biological and experimental roles. This information would be valuable for researchers when they need to select and integrate various PPI data bases. www.intechopen.com

Methods for PPI identification
There are two major ways to determine PPIs. One is an experimental detection and the other is computational prediction. The former method is more reliable and wellestablished in both small and large scales while the latter method is based on the characteristics of accumulated protein interactions. In this section, we will briefly describe both approaches.

Experimental detection methods for PPIs
Experimental detection of interactions between protein pairs is achieved by various methods. Here, we describe only two representative methods: yeast two-hybrid (Y2H) (Suter et al, 2008) and mass spectrometry (MS) (Berggard et al, 2007). These methods both detect physical PPIs but the type of PPIs is different. As previously stated, direct and binary PPIs are distinct from protein groups in a complex and this type of PPI is detected only by Y2H method. This method uses a transcription factor found in yeast which consists of two other domains. Y2H method relies on an artificial insertion of a protein coding sequence to one of the domains and another protein inserted on the other domain using a plasmid. PPI can be assessed by confirming phenotype of the target gene of the transcription factor. The Y2H method can detect PPIs in large-scale and the sensitivity is high, enabling detection of even weak transient PPIs. But, since the experiment is done only in the nucleus, the real location information of such PPIs is hard to annotate, which obscures the detailed biological interpretation. Moreover, Y2H detects only binary interactions and results in a high rate of false positive, which are noteworthy limitations.
Another method in this category is based on mass-spectrometry (MS). The MS analyzes the mass of molecules rapidly and accurately. If the weight of all proteins in question is known, this information can be linked to the specific protein. This method is powerful when protein co-complexes are examined. Although it cannot provide details on the direct-level of interactions, the grouping of the proteins in a complex can be revealed. For this method, one protein ("bait") and all of interacting partners in a complex are pulled out and separated by electrophoresis. Finally, all the constructs derived from electrophoresis are used for MS. This method yields many false positive results when the sampling strategy is thoroughly different. This sampling might include fake interactions resulting in a high rate of false readings. There are many strategies related to this problem (Bousquet-Dubouch et al, 2011;Gingras et al, 2007). The experimental results obtained with MS-based methods are different from those obtained with binary methods (Y2H). Data derived from co-complex experiments cannot directly assign a binary interpretation. An algorithm is needed to translate group-based observations into pairwise interactions.

Computational prediction methods for PPIs
While recent reviews Pitre et al, 2008;Shoemaker & Panchenko, 2007;Skrabanek et al, 2008;Xia et al, 2010) have discussed computational prediction methods for PPIs in details, we here briefly introduce som e o f a p p r o a c h e s t h a t a r e w i d e l y u s e d .
Although the amounts of experimental resources of PPIs are growing rapidly, proteomewide PPIs information is still lacking and mostly limited on several model organisms. Given wide types of indirect but genome-wide resources, we can enhance our understanding of overall protein interactome. Methods in prediction of direct physical PPIs are less investigated than those of functional association between protein pairs. These functional association methods of PPIs can give information of which protein pairs have same biological process and potential physical interactions.
The first data used in these prediction methods is genomic sequences. Co-occurrence-based methods use assumption that if gene pairs are co-inherited across evolutionary processes (i.e. species), they are considered as functionally associated (Barker & Pagel, 2005;Bowers et al, 2004;Pellegrini et al, 1999). These methods applied to microorganisms and successfully discovered novel participants of known pathway (Carlson et al, 2004;Luttgen et al, 2000).
Other similar methods based on this genomic sequence use the information of gene fusion events Reid et al, 2010;Zhang et al, 2006) and gene neighbourhood (Ferrer et al, 2010;Itoh et al, 1999;Koonin et al, 2001). Another type of data used is amino acid (AA) sequences and the interface of interacting protein pairs are composed of specific AA residues (Tuncbag et al, 2008;Tuncbag et al, 2009). This knowledge is reflected in the coevolution of specific interface residues between interacting proteins and by alignments of multiple sequences, the results are highly correlated with physical PPIs (Pazos et al, 2005).
Commonly occurring domain pairs are also considered in this context (Eddy, 2009;Finn et al, 2010;Stein et al, 2009;Yeats et al, 2011) and simple AA sequence such as 3-mers of interacting residues can be used (Ben-Hur & Noble, 2005). Another well-known information is homology of PPIs across different species. Methods on this information simply find PPIs which are conserved across species, called interologs (Matthews et al, 2001). Here, any known PPIs regarded as query to find conserved interactions across species using an ortholog database. There are many algorithms which follow this approach (Kemmer et al, 2005;Persico et al, 2005). Aside from the sequence-level data, structural information is also a valuable resource to predict PPIs, especially a protein 3D structure. (Aloy & Russell, 2003;Ezkurdia et al, 2009;Hosur et al, 2011;Shoemaker et al, 2010;Singh et al, 2010;Zhang et al, 2010). A huge amount of genome-wide gene expression profiles are another useful data to predict PPIs and they are investigated to define gene co-expression patterns of any pairs and consider higher correlation degree as higher probability of PPIs (Grigoriev, 2001;Lukk et al, 2010;Stuart et al, 2003). As shown in the earlier section, there are many literature-curated PPI databases. While those approaches are based on the manual inspection, such PPIs information can be automatically extracted using a text-mining algorithm (Blaschke et al, 2001;Szklarczyk et al, 2011;Tikk et al, 2010).

Computational prediction methods for protein function
Even before the prevalence of genome sequencing technologies, typical experimental identification on a protein function has been executed. Such identification has focused on a specific target gene or protein, or a small set of protein complexes. Gene knockout, knockdown of gene expression, and targeted mutations are some methods for protein function identification (Recillas-Targa, 2006;Skarnes et al, 2011). Such low-throughput experiments were replaced by high-throughput experiments including genome sequencing and determination of the protein interactome. Computational methods followed by massively archived data have been developed for better analysis. Based on the assumption that structural similarity correlates with functional similarity, homology-based functional annotation across organisms has now become a trivial approach (Aloy et al, 2001;Gaudet et al, 2011).

Non-network based approaches
Classical computational methods use features from only a single protein in prediction of protein function . These approaches use a set of features like amino acid sequences, genome sequences, protein structures (2D and 3D), phylogenetic data, and gene expression data. PSI-BLAST (Altschul et al, 1997) and FASTA (Mount, 2007) are popular sequence alignment tools used to reveal homologous proteins between known and unknown (query) proteins. Proteins with similar sequences are assumed to have similar functions. Moreover, protein folding patterns are also preserved enough to identify homologs Sanchez-Chapado et al, 1997). The comparative genomics across different species is a powerful approach for analysing functional annotation of proteins. In fact, it has been suggested that correlation of sequence-structure is much stronger than that of sequence-function (Smith et al, 2000;Whisstock & Lesk, 2003). So many approaches take the sequence to structure to function route for protein function prediction (Fetrow & Skolnick, 1998).
Likewise, these data are showing only single aspect of functional features conserved during evolution. Data derived from different sources can be inter-connected it should be integrated to analyse simultaneously (Kemmeren & Holstege, 2003). We next show that PPI networks potentially enrich functional relationship between protein pairs that may not be detectable from other genomic data such as primary or higher level sequence structure.

Network-based approaches
As we mentioned in the Introduction, biological function is never achieved by a single protein. Rather, proteins dynamically interact with each other and the interacting partners adopt similar performances for specific functions. With a plethora of data being generated by high-throughput proteomic experiments, it has become possible to use proteome-wide PPI patterns in protein function prediction. Among a broad type of protein interactome, a PPI network generates well-known data that is invaluable in prediction of protein function. It is possible to annotate the function of undefined proteins according to its neighbours that are functionally annotated. This assumption is based on simple idea called "guilt-by-association", and we consider an association by possible physical interaction in any condition and, sometimes, functional association are given with relevant evidence score.
Here, we review the general network-based approaches in predicting protein functions. These approaches are categorized into two methods for better description. The first one is a straightforward method of inferring protein function based on the topological structure of a PPI network. The other method first identifies distinct sub-networks from a whole PPI network. These sub-networks are also referred to as functional modules since they perform specific biological functions such as protein complexes, and metabolic and signalling pathways. Functional modules are detected by a broad variety of clustering algorithms and, thereafter, each module is annotated with appropriate functional association. In this section, basic concepts and pioneering studies on this corresponding approaches are introduced.

Neighbourhood approaches
Direct functional annotation considers the correlation of the network distance between two proteins, which means the closer the two proteins are in the network the more similar are their functions. One of the earliest studies extrapolated only adjacent neighbours within an entire PPI network. This simple approach used information of the immediate neighbourhood and took the most common functions up to three among its neighbours. In spite of the effectiveness, accuracy was achieved by 72% (Schwikowski et al, 2000). However, this method lacked significance values for each association and the full network topology was not considered in the annotation process. A strategy was proposed to tackle the first problem of assigning statistical significance (Hishigaki et al, 2001). This was done by using 2 χ .-like scores and, instead of using the immediate neighbours, the n-neighbourhood of a protein that consists of proteins with distance of k-links to the protein is considered. Simply put, the neighbours of adjacent neighbours are taken into account with the frequencies of all the distance of in this neighbourhood. For an unknown protein, the functional enrichment in its n-neighbourhood in identified with 2 χ test, and the top ranking functions are assigned to the unknown protein. In another approach, the shared neighbourhood of a pair of proteins are considered besides from the neighbourhood of the protein of interest. Chua et al. investigated the correlation between functional similarity and network distance (Chua et al, 2006). They developed a functional similarity score, called the FS-weight measure, which gives different weights to proteins depending on their network distance from the query protein. This approach showed higher accuracy when employing indirect interactions and its functional association.

Global optimization approaches
Although the neighbourhood approach is very attractive and effective by its simplicity, shortcomings arise when there is not enough number of protein neighbours and sufficiently annotated proteins. To overcome this issue, several approaches that utilize the entire topology of the network have been proposed. These global approaches attempt to optimize annotation of function-unknown protein using the topology of a whole network. One of the first studies that took this approach used the theory of Markov random fields, which determines the probability of a protein having a certain function (Deng et al, 2004). This theory is then used to determine the joint probability of the whole interaction network regarding to a certain function. This formulation is transformed to that of the conditional probability of a protein having a certain function given the annotations of its interaction partners. After that, the Gibbs sampling technique is iteratively applied to determine the stable values of this probability for each protein. This approach resulted in higher performance than those of neighbourhood-based approaches (Chua et al, 2006;Hishigaki et al, 2001;Schwikowski et al, 2000) when utilized to the yeast PPI data.
Additional attempts according to this approach had been followed. Here, the objective function is defined for the whole network, which is a sum of the following variables (Vazquez et al, 2003).
1. The number of neighbours of a protein having the same function as itself.
2. The number of neighbours of a protein having the function under consideration.
Thus, this function estimates the number of pairs of interacting proteins with no common functional annotation. Since a high value of this function is biologically undesirable, it is minimized using a simulated annealing procedure. As expected, this approach outperformed the majority rule-based strategy on the Saccharomyces cerevisiae interaction data (Schwikowski et al, 2000), since the latter tried to optimize only the second factor above. An additional advantage of this approach was that multiple annotations of all proteins were obtained in one shot, unlike earlier approaches which ran independent optimization procedures for different functions.
The above discussion shows that a wide variety of approaches based on principles of global optimization have been proposed in the literature and many more are in the pipeline. The most accurate results in the field of function prediction from PPI networks have also been achieved by these approaches, which is intuitively acceptable since they extract the maximum benefit from the knowledge of the structure of the entire network.

Indirect annotation of protein function
This approach uses a protein interaction network, not directly for annotation, but identifies functional modules first and then assigns functions to unknown proteins based on their membership in the functional modules. This is based on the assumption that most biological networks are organized as distinct sub-networks to give specific functions (Hartwell et al, 1999). We assume that proteins in the same module participate in a similar biological process. Modular patterns and dense regions are found in the PPI network (Gavin et al, 2006).

Distance-based clustering approaches
To find biologically significant modules, clustering algorithms can be applied efficiently.
Clustering is a popular unsupervised learning algorithm that does not use any prior information about the class label. There are two widely-used ways of clustering: topologybased or distance-based. The key procedure in distance-based clustering is to select the similarity measure between two proteins to detect modules. The distance between two proteins (also called as nodes) in a network is usually defined as the number of interactions (also called as edges) on the shortest path between them. However, there is a serious problem in this hierarchical clustering, known as the 'ties in proximity' problem (Arnau et al, 2005). This means that the distance between many protein pairs are identical.
To solve this problem, a network clustering method was developed to identify modules in the biological network based on the fact that each node has a unique pattern of shortest path lengths to every other node. But for a specific module in the network, the nodes/members of the module shared similar pattern of shortest path lengths (Rives & Galitski, 2003). Another study used the hierarchical clustering method with the shortest path length www.intechopen.com between proteins as a distance measure to overcome the 'ties in proximity'. This was achieved by exploiting equally valid hierarchical clustering solution with a random select when ties are met (Arnau et al, 2005). Although many methods in the similarity measures have been proposed, a single validation for such methods is insufficient. For this, two evaluation schema are suggested, which are based on the depth of a hierarchical tree and width of the ordered adjacency matrix (Lu et al, 2004). Furthermore, there are various types of cellular network with distinct modular patterns, and so network-specific methods should be investigated in the future.

Graph-based clustering approaches
Dissecting functional modules in a large PPI network is the same problem of graph partitioning and clustering. One of the pioneering method using this network topologybased concept was the MCODE (molecular complex detection algorithm) (Bader & Hogue, 2003). This method predicts complexes in a large PPI network consisting of three processes. First, the nodes of the network are weighted by their core clustering coefficients (the density of the largest k-core of its adjacent neighbourhood), and then densely connected modules are identified in a greedy fashion. The use of this coefficient instead of a standard clustering coefficient was proposed, as it increases the weights of densely interconnected graph regions while giving small weights to the less connected nodes. The next step is to filter or add proteins based on the connectivity criteria. This method was applied to large-scale PPI networks and given as a plug-in for the Cytoscape (Kohl et al, 2011).
Another similar study to find complexes and functional modules is based on super paramagnetic clustering. This method used an analogy to the physical features of a heterogeneous ferromagnetic model to detect densely connected clusters in a large graph (Spirin & Mirny, 2003). There is also an algorithm called the restricted neighbourhood search clustering (RNSC), which starts with an initial random cluster assignment and then proceeds by reassigning nodes to maximize the partition's score. Here, the score represents an intra-connectivity in the cluster, not an inter-connectivity across other clusters. The RNSC algorithm is known to perform better than the MCODE algorithm (King et al, 2004). The Markov clustering algorithm (MCL) is another fast and scalable clustering algorithm based on simulation of random walks on the underlying graph (Pereira-Leal et al, 2004). This algorithm has an assumption that a random walker in natural clusters (i.e. dense region of the graph) sparsely goes from one to another natural cluster. Such clusters in a whole graph are structurally identified by the MCL algorithm. It starts by measuring the probabilities of random walks through the graph to build a stochastic "Markov" matrix, by alternating two operations: expansion and inflation. The expansion takes the squared power of the matrix while the inflation takes the Hadamard power of a matrix, followed by a re-scaling. Therefore the resulting matrix is remained as stochastic. Clusters are detected by alternation of expansion and inflation until the graph is partitioned into distinct subsets where no paths between these subsets are available. This algorithm can be efficiently implemented to weighted and large dense graphs. Various PPI networks were applied using the MCL algorithm to find functional modules such as protein complex (Krogan et al, 2006).
It is true that a protein might have multiple functions and this characteristics of a protein leads to overlap of different modules. That means graph partitioning in a strict manner might not be reasonable for the PPI network. However, most current methods are based on the hard-partition algorithms, meaning that each protein can belong to only one specific module. To handle this limitation, a clustering algorithm based on the information flow was suggested. This algorithm efficiently identified the overlapping clusters in weighted PPI network by integrating semantic similarity between GO function terms (Cho et al, 2007).
Since the common proteins in the overlapping modules are interpreted as a connecting bridge across the different modules, biologically significant and functional sub-networks could be identified. Still, there are few clustering methods identifying such overlapping modules. Novel clustering methods for this theme are required with enhancement of prediction accuracy.

Prediction of protein subcellular localization 4.1 Introduction
Proteins should move to specific locations after synthesis to work in our body correctly. Thus, knowing subcellular localization of proteins is important to understand their own functions. Unicellular organisms like budding and fission yeasts can find systematic protein localization by experimental studies. However, such studies could not be performed well in higher eukaryotes such as Caenorhabditis elegans, Drosophila melanogaster, or mammals because of large-scale proteome sizes and technical difficulties associated with protein tagging.
Therefore, bioinformatical approaches to develop efficient methods are required instead of wet experiments. Actually, many computational methods to predict subcellular localization of protein have been proposed over several decades. A considerable number of computational classification methods have been developed for this purpose. Typically these algorithms input list of features and output subcellular localizations of target proteins. The features contain various characteristics of the proteins. Molecular weight, amino acid content and codon bias can be the features. Input features for prediction of subcellular localization can be broadly categorized into four categories: protein sorting signals, empirically correlated characteristics, sequence homology with known answer sets, and other sources (Imai & Nakai, 2010).
During the training phase, in the methods, learning utilizes a set of gold-standard proteins whose localizations are well known. This set consists of the feature vectors. After the training phase, a model is constructed to recognize those features or patterns of features that are useful and then predicts the subcellular localization of proteins whose localization is unknown. Various algorithms have been used to construct a model for prediction of subcellular localization.
In the field of bioinformatics, there are several problems to resolve for predicting subcellular localization of proteins. First, there are generally too many classes (localization). According to Huh et al, 22 distinct localizations exist in budding yeast. Next, one protein may have multiple different localizations (Huh et al, 2003). This is referred to a multi-label classification problem and traditional classification algorithms have a limit on handling the multi-label problem well. Another problem is that there may be a higher dimensional feature space for prediction. More than tens of thousands features exist in some cases.

www.intechopen.com
Another issue is that data for each localization is too imbalanced. All these characteristics make the prediction difficult. More importantly, the localization prediction is sometimes difficult to achieve sufficient performance when we use information of single proteins only.
Recently, large-scale protein-protein interaction networks have been elucidated in yeast, fly, worm, and human. To interact physically, two proteins should localize to the same or adjacent subcellular localization. That means we can get useful information of a protein from its interacting neighbours. Thus, we can improve the localization prediction performance particularly using PPI networks. Table 4 summarizes previous studies that have used the features of single proteins. The studies for prediction of subcellular localization have the following trends. The first is an increase in the number of predicting localizations. At first, Nakashima & Nishikawa predicted localization of a protein that is inter-cellular or extra-cellular using Amino Acid (AA) and Pair coupled Amino Acid (PairAA) (Nakashima & Nishikawa, 1994). After their study, many studies tried to increase the number of distinct localizations to predict. For example, Gardy et al predicted five distinct subcellular localization including 'cytoplasmic', 'inner membrane', 'periplasmic', 'outer membrane' and 'extra-cellular' (Gardy et al, 2003). Nair & Rost predicted ten distinct subcellular localizations (Nair & Rost, 2003). Also, Chou & Cai predicted 22 distinct subcellular localizations that experimentally identified localization of Huh et al. .

Single-protein feature based localization predictions
The second trend is handling of a multi-label problem. A protein can localize to several subcellular locations. However, most of these studies did not consider multiple localization property, but rather assumed that a protein has a single representative localization. Also, the accuracy of prediction is lower when the number of distinct localizations for a protein is increased. Some researchers have been tried to address this issue (Lee et al, 2006).
Another tendency is the development of a classification algorithm for an elaborate and efficient model construction. Least distance algorithm, artificial neural network, a nearest neighbour approach, a Markov model, a Bayesian network approach, and support vector machine (SVM) were used to archive the goal. Some studies mixed several algorithms. Lee et al. developed an algorithm that reflects of property of the prediction task (Lee et al, 2006). They developed an extended Density-induced Support Vector Data Description (D-SVDD) classification algorithm to handle well the issues related to class imbalance, higher dimensionality, multi-label, and many distinct classes. The classical D-SVDD algorithm can handle only one-class classification tasks. Thus, Lee et al. extended it to handle multi-label classification tasks.

Network-based localization prediction
As mentioned earlier, two proteins that localize to same or adjacent subcellular localization have a tendency to interact with each other. That means two proteins can be a tag protein to one other for subcellular localization. Therefore, if a molecular network such as PPIs is available, we may take advantage of the PPI network for the prediction. Several studies tried to predict subcellular localization using network data. This section consists two parts: first one is a brief explanation of the study by Lee et al. (Lee et al, 2008), which is the cornerstone of the network-based approach for location prediction using PPI network. We describe a methodology to generate of feature vectors for a protein in the aforementioned study and introduce a DC-kNN classifier for the prediction. The second part is a summary of the network-based approaches from the work of Lee et al. to the present. (Nakai & Kanehisa, 1991) Expert Systems SignalMotif 4 X X (Nakai & Kanehisa, 1992) Expert Systems AA, SingalMotif 14 X X (Nakashima & Nishikawa, 1994) Scoring System AA, diAA 2 X X (Cedano et al, 1997) LDA using Mahalanobis distance AA 5 X X (Reinhardt & Hubbard, 1998) ANN Approach AA 3, 4 X X (Chou & Elrod, 1999) CDA AA 12 X X (Yuan, 1999) Markov Model AA 3, 4 X X (Nakai & Horton, 1999) k-NN approach SignalMotif 11 X X (Emanuelsson et al, 2000) Neural network SignalMotif 4 X X (Drawid & Gerstein, 2000) CDA Gene Expression Pattern 8 X X (Drawid & Gerstein, 2000) Bayesian Approach SignalMotif, HDEL motif 5, 6 X X (Cai et al, 2000) SVM AA 12 X X (Chou, 2000) Augumented CDA AA, SOC factor 5, 7, 12 X X  Table 4. Summary of previous methods for prediction of protein subcellular location.

Generation of feature vectors
Lee et al. used three types of feature to predict the localization and integrated these features (Lee et al, 2008). These are single protein features (S) and two kinds of network neighbourhood features (N and L).
Seven S features were based on a protein's primary sequence and its chemical properties. Amino acid composition frequencies (AA), adjacent pair amino acid frequencies (diAA) and pair-wise amino acid frequencies with a gap which is length of 1 (gapAA) from a protein's primary sequence were used. Also, three kinds of chemical amino acid compositions (chemAA) were generated from normalized hydrophobicity (HPo), hydrophilicity (HPil), or side-chain mass (SCM). Also, they combined these chemical properties into pseudo-amino acid composition (pseuAA), which is another S feature vector. Occurrences of known signalling motifs in the primary protein sequence (Motif) are also used as one of the S features. The last S feature encoded functional annotations of the protein from Gene Ontology (GO) (Ashburner et al, 2000). Figure 1 provides an example.
N network features are summary of S features from neighbourhood of a protein. Knowledge for neighbours of a protein comes from PPI data, which are pooled from various databases such as BioGRID , DIP (Salwinski et al, 2004) and SGD (Engel et al, 2010). L network features are summary of location distribution of interacting neighbours. Figure 2A shows a relationship among the three PPI databases. It shows that a single protein interaction database covers a different part of the whole reported interactions. The diagonal pattern in Figures 2B-D shows that interacting protein pairs share similar localization information. For example, a protein in an "ER to Golgi" tends to interact with other proteins which localized in the "ER to Golgi" more than other localizations.

Divide-and-Conquer k-Nearest Neighbour (DC-kNN) Classifier
After generating feature vectors, large-scale feature vectors with a high order may generate. A high dimensional feature vectors generally cause some problems like curse-ofdimensionality. In other words, data from higher dimensional feature vectors usually require a corresponding amount of inputs and it, sometimes, causes an over-fitting problem to a given dataset (Guyon et al, 2002). Also some feature vectors may be useless in constructing a model for a specific localization. Thus, individual model for different subcellular localizations may require different sets of useful feature sets. Therefore, extraction for feasible feature vectors for individual localizations may be needed to construct robust and reliable prediction models.
To construct a prediction model, Lee et al. proposed a DC-kNN classifier which is a variety of a k-Nearest Neighbours classification algorithm. A DC-kNN classifier tackles highdimensional features in a divide-and-conquer manner. Briefly mentioning, a DC-kNN has three main steps ( Figure 3): dividing, choosing, and synthesizing. In the dividing step, the full feature vector is divided into m meaningful subsets. After the dividing step, the knearest neighbours are chosen for each protein and for each subvector. In the synthesizing step, results of kNNs of individual m sets are synthesized to produce confidence scores www.intechopen.com using an average of Area under the ROC curve (AUC) for each localization. DC-kNN finds a feasible combination of feature sub-vectors for each label (localization) based on a feature forward selection approach. Fig. 3. Brief description of a DC-kNN (adapted from Lee et al, 2008).

Results of location prediction
Lee et al. first compared prediction performance of a DC-kNN for localization prediction with different feature sets: S features only, N features only, L features only, all features together (S+N+L), and random guesses. N and L features are generated using DIP (Salwinski et al, 2004). Performance of each case was evaluated by the technique of leaveone-out cross-validation (LOOCV). Proteins of Saccharomyces cerevisiae (n=3914) (Huh et al, 2003) were used for the LOOCV. They used three different performance metrics: Top-K, Total, and Balanced. These metrics were used to summarize the results of 3914 LOOCV runs. Top-K measurement considers as correct if at least one of the real localization of a protein is in the top-K predictions. Total measurement counts all the correctly predicted localizations based on the number of real localizations of test data. Balanced measure calculates the averaged fraction of correctly predicted proteins in each localization. As a result, every classifier showed clearly better performance than random guess ( Figure 4A), and combination of S, N, and L features showed the highest performance.
Figures 4A and 4B inform that information of neighbourhood acquired from a PPI database improves prediction performance. However, Figure 4C illustrates that acquiring more information does not always contribute to an improvement of performance. On the contrary, additional information can decrease prediction performance.    Figures 5A and 5B). The correct prediction mainly owes to the fact that Lee et al. combined evidence from multiple interacting partners. For example, Noc4 interacts with many other proteins known to exist in the nucleolus, so we can assume that Noc4 localizes nearby or directly in the nucleolus. They confirmed the assumption by the network neighbours (Lee et al, 2008) ( Figure 5C).
The number of localizations and known PPIs for yeast proteins are larger than those for other organisms. In other words, some organisms have less information on known localization and protein interaction, which might make the location prediction difficult based on a PPI network. Lee et al. evaluated their method using yeast data with some random missing information (Lee et al, 2008). As a result relatively robust results were obtained with less information. For example, the average number of neighbours of a protein in yeast is 27 and the number in worm is three. Decrement in the number of neighbours from yeast to worm was 9-fold. However, the average of AUC value decreased from 0.94 (yeast) to 0.87 (worm) (Figure 6). In other words, their method can be easily applied, not only to yeast but to other species with less known localization and/or interaction information. Actually they predicted subcellular localization of fly, human, and Arabidopsis (Lee et al, 2008;Lee et al, 2010b) using protein interactions. The results of both works showed that the prediction worked well for the other organisms and could find real localizations of some unknown proteins (Figures 6-7).
They also compared a DC-kNN with two previous popular methods, ISort (Chou & Cai, 2005) and PSLT2 (Scott et al, 2005). ISort is a comprehensive sequence-based machine learning method. ISort can predict more than 15 compartments. PSLT2 is a previous method that used a protein interaction network to predict subcellular localizations. They compared to DC-kNN with ISort and PSLT2 using both total and balanced measures. As illustrated in Figure 8, DC-kNN outperformed both methods in total and balanced measurement.

Other network-based methods
After the study of Lee et al. in 2008, several studies based on network-based approaches tried to predict subcellular localization. Mintz-Oron et al. used a constraint-based method for predicting subcellular localization of enzymes based on their embedding metabolic network, relying on a parsimony principle of a minimal number of cross-membrane metabolite transporters (Mintz-Oron et al, 2009). They showed that their method outperformed pathway enrichment-base methods. Another group constructed a decision tree-based meta-classifier for identification of essential genes (Acencio & Lemke, 2009). Their method relied on network topological features, cellular localization and biological process information for prediction of essential genes. Tung & Lee integrated various biological data sources to get information of neighbour proteins in a probabilistic genenetwork (Tung & Lee, 2009 (Aranda et al, 2010) using the DC-kNN method, which was proposed before and which showed good performance (Lee et al, 2010b). They also showed that the DC-kNN is applicable to other organisms. Kourmpetis et al. predicted a function of proteins in Saccharomyces cerevisiae based on network data, such as PPI data (Kourmpetis et al, 2010). They took a Bayesian Markov Random field analysis method for prediction and predicted the functions of 1170 un-annotated Saccharomyces cerevisiae proteins.

Conclusions
We reviewed on PPI databases and the methods for detection of PPIs. Then, the computational methods of protein function prediction were briefly reviewed. We finally discussed that the prediction of protein function, especially the subcellular localization, shows outstanding performance when using PPIs data. This is because real biological functions are maintaining through a cascade of PPIs. Moreover, the computational approaches are very much promising when compared to the experimental identification especially for the false reading corrections. Functional genomics is an ongoing field in systems biology and this must be done well to drive further progress. We are facing other issues concerning the lack of conditional protein interactomes. We have identified and accumulated only static information at the molecular level in cells to make a scaffold of cellular systems. Computational methods should be applied to this conditional analysis when sufficient data become available and the next field of utilization would be personalized medicines, such as the early diagnosis with specific markers and treatments with specific drug targets.  Proteins are indispensable players in virtually all biological events. The functions of proteins are coordinated through intricate regulatory networks of transient protein-protein interactions (PPIs). To predict and/or study PPIs, a wide variety of techniques have been developed over the last several decades. Many in vitro and in vivo assays have been implemented to explore the mechanism of these ubiquitous interactions. However, despite significant advances in these experimental approaches, many limitations exist such as falsepositives/false-negatives, difficulty in obtaining crystal structures of proteins, challenges in the detection of transient PPI, among others. To overcome these limitations, many computational approaches have been developed which are becoming increasingly widely used to facilitate the investigation of PPIs. This book has gathered an ensemble of experts in the field, in 22 chapters, which have been broadly categorized into Computational Approaches, Experimental Approaches, and Others.