Mining Protein Interaction Groups

Proteins with interactions carry out most biological functions within living cells such as gene expression, enzymatic reactions, signal transduction, inter-cellular communications and immunoreactions. As the interactions are mediated by short sequence of residues among the long stretches of interacting sequences, these interacting residues or so-called interaction (binding) sites are at the central spot of proteome research. Although many imaging wet-lab techniques like X-ray crystallography, nuclear magnetic resonance spectroscopy, electron microscopy and mass spectrometry have been developed to determine protein interaction sites, the solved amount of protein interaction sites constitute only a tiny proportion among the whole population due to high cost and low throughput. Computational methods are still considered as the major approaches for the deep understanding of protein binding sites, especially for their subtle 3-dimensional structure properties that are not accessible by experimental methods.

(possible interaction sites) by multiple sequence alignments within each protein group. Thus, those conserved motifs can be paired with motifs identified from other protein groups to model protein interaction sites. One of the novel aspects of the algorithm in (Li et al., 2006) is that it combines two types of data: the PPI data and the associated sequence data for modeling binding motif pairs. Each protein in the above PPI networks is represented by a vertex and every interaction between two proteins is represented by an edge. Discovering complete bipartite subgraphs in PPI networks can thus be formulated as the following biclique problem: Given a graph, the biclique problem is to find a subgraph which is bipartite and complete. The objective is to maximize the number of vertices or edges in the bipartite complete subgraph. We note that the maximum vertex biclique problem is polynomial time solvable (Yannakakis, 1981). This problem is also equivalent to the maximum independent set problem on bipartite graphs which is known to be solvable by a minimum cut algorithm. However, the maximum vertex balanced biclique problem is NP-hard (Garey & Johnson, 1979). The maximum edge biclique problem is proved to be NP-hard as well (Peeters, 2003).
In this paper, we consider incompleteness of biological data, as the interaction data of PPI networks is usually not fully available. On the other hand, within an interacting protein group pair, some proteins in one group may only interact with a proportion of the proteins in the other group. Therefore, many subgraphs formed by interacting protein group pairs are not perfect bicliques. They are more often near complete bipartite subgraphs. Therefore, methods of finding bicliques may miss many useful interacting protein group pairs. To deal with this problem, we use quasi-bicliques instead of bicliques to find interacting protein group pairs. With the quasi-biclique, even though some interactions are missing in a protein interaction subnetwork, we can still find the two interacting protein groups. In this paper, we introduce and investigate the maximum vertex quasi-biclique problem. We show that the problem is NP-hard. We also propose approximation and heuristic algorithms for finding large quasi-bicliques in PPI networks. The applications for finding protein-protein binding sites are illustrated.

Bicliques and quasi-bicliques
Let G =(V, E ) be an undirected graph, where each vertex represents a protein and there is an edge connecting two vertices if the two proteins have an interaction. Since G is an undirected graph, any edge (u, v) ∈Eimplies (v, u) ∈E. For a selected edge (u, v) in G,inordertofind the two groups of proteins having the similar pairs of binding sites, we translate the graph G =( V, E ) into a bipartite graph. Let X = {x|(x, v) ∈E} , Y 1 = {y|(u, y) ∈E&y ∈ X} and Y 2 = {w|(u, w) ∈E&w ∈ X}.F o rav e r t e xw ∈ Y 2 , w is incident to both u and v in G.T h u s both X and Y 2 contain w. We keep w in X and replace w in Y 2 with a new virtual vertex w. After replacing all vertices w in Y 2 with w, we get a new vertex set Y 2 .L e tY = Y 1 ∪ Y 2 and E = {(x, y)|(x, y) ∈E&x ∈ X&y ∈ Y 1 }∪{(x, w)|(x, w) ∈E&x ∈ X&w ∈ Y 2 }.I nt h i sw a y , we have a bipartite graph G =( X ∪ Y, E). A biclique in G corresponds to two subsets of vertices,say ,subsetA and subset B,inG.I nG, every vertex in A is adjacent to all the vertices in B, and every vertex in B is adjacent to all the vertices in A.M o r e o v e r ,A ∩ B may not be empty. In this case, for any vertex w ∈ A ∩ B, (w, w) ∈E. This is the case, where the protein has a self-loop. Self-loops are very common in practice. When a self-loop appears, one protein 114 Protein-Protein Interactions -Computational and Experimental Tools www.intechopen.com molecule interacts with another identical protein molecule. For example, two identical protein subunits can assemble together to form a homodimeric protein.
In the following, we focus on the bipartite graph G =( X ∪ Y, E).F o rav e r t e xx ∈ X and a vertex set Y ′ ⊆ Y,t h ed e g r e eo fx in Y ′ is the number of vertices in Y ′ that are adjacent to Similarly, for a vertex y ∈ Y and X ′ ⊆ X,w eused(y, X ′ ) to denote |{x|x ∈ X ′ &(x, y) ∈ E}|. Now, we are ready to define the δ-quasi-biclique.
Similarly, a δ-quasi-biclique in G corresponds to two subsets of vertices, say, subset A and subset B,i nG.I nG, every vertex in A is adjacent to at least (1 − δ)|B| vertices in B,a n d every vertex in B is adjacent to at least (1 − δ)|A| vertices in A. Moreover, according to the translation and the definition, A ∩ B may not be empty. Again, if a protein appears in both sides of a δ-quasi-biclique and there is an edge between the two corresponding vertices, the protein has a self-loop. In our experiments, we observe that about 22% of the δ-quasi-bicliques produced by our program contain self-loop proteins.
In many applications, due to various reasons, some edges in a clique/biclique may be missing and a clique/biclique becomes a quasi-clique/quasi-biclique. Thus, finding quasi-cliques/quasi-bicliques is more important in practice. Here we show that large quasi-bicliques may not contain any large bicliques.
Theorem 1. Let G =( X ∪ Y, E) be a random graph with |X| = |Y| = n, where for each pair of vertices x ∈ Xa n dy ∈ Y, (x, y) is chosen, randomly and independently, to be an edge in E with probability 2 3 .W h e nn→ ∞, with high probability, G is a 1 2 -quasi-biclique, and G does not contain In the biological context, Theorem 1 indicates that it is possible that some large interacting protein groups cannot be obtained by simply finding a maximal biclique if a few (interaction) edges are missing. As large interacting protein groups are more useful, according to this theorem, we have to develop new computational algorithms to extract from PPI networks large interacting protein groups which form quasi-bicliques.
In terms of false positive edges, both quasi-biclique and biclique can handle spurious edges very well. If very few spurious edges are added, in most cases, an irrelative protein will not be included in the quasi-bicliques or biclique unless (1 − δ)|A| spurious edges are simultaneously added to the protein that has no interaction with any of the proteins in A, where A is one of the two interaction groups.
The maximum vertex quasi-biclique problem is defined as follows.
The maximum vertex biclique problem, where δ = 0, can be solved in polynomial time (Yannakakis, 1981). Here we show that the maximum vertex δ-quasi-biclique problem 115 Mining Protein Interaction Groups www.intechopen.com when δ > 0 is NP-hard. The reduction is from X3C (Exact Cover by 3-Sets), which is known to be NP-hard (Karp, 1972).
Theorem 2. For any constant integers p > 0 and q > 0 such that 0 < p q ≤ 1 2 , the maximum vertex p q -quasi-biclique problem is NP-hard.

A polynomial time approximation scheme
The following lemma that is originally from (Li et al., 2002) will be repeatedly used in our proofs.

The Main Ideas and Techniques:
The problem can be formulated as a quadratic programming problem. We use a random sampling technique and a randomized rounding method to get a good approximate solution for the quadratic programming problem under the conditions that |X opt | = Ω(|X|) and |Y opt | = Ω(|Y|). The random sampling technique involves to randomly select r1 = Ω(log |X opt |)) vertices from X opt when X opt is not known. This can be done when |X opt | = Ω(|X|) and |Y opt | = Ω(|Y|).
Let G =( X ∪ Y, E) be the input bipartite graph. Let X opt ⊆ X and Y opt ⊆ Y be the optimal biclique for the maximum quasi-biclique problem. Without loss of generality, we can assume that The basic idea of our algorithm is to (1) formulate the problem into a quadratic programming problem and (2) use a random sampling approach to approximately solve the problem. In order to make the random sampling approach work, we have to make sure that However, for any input bipartite graph G =( X ∪ Y, E), there is no guarantee that (1) and (2) hold. Here we propose a method to find a subset X ′ of X and Y ′ of Y such that for any If we can obtain this kind of X ′ and Y ′ , then we can work on the induced bipartite graph Obviously, any good approximate solution of G ′ is also a good approximate solution of G.
Let x i be a vertex in the bipartite graph G =(X ∪ Y, E).Defi n eD(x i , Y) to be the set of vertices in Y that are incident to x i . The following lemma tells us how to obtain X ′ and Y ′ .
Though we do not know which k vertices in X we should choose, we can try all possible size k subsets of X in O(|X| k ) time for constant k.T h ev a l u eo fk is ⌈δt⌉ and is determined by t later. Thus, from now on, we assume that the k vertices x 1 , x 2 , ..., x k are known. Let . We will focus on finding a quasi-biclique in the sub-graph . From now on, we will try to find a good approximate solution for X ′ opt and Y ′ opt . If |X ′ opt | and |Y ′ opt | are approximately the same, then we have |X ′ opt | = Ω(|X ′ |) and |Y ′ opt | = Ω(|Y ′ |). That is, (1) and (2) hold for graph G ′ . Therefore, we can use quadratic programming approach to solve the problem. Nevertheless, there is no guarantee that |X ′ opt | and |Y ′ opt | are approximately the same. For any ǫ > 0, we consider two cases.
In this case, the number of vertices in Y ′ opt will dominate the size of the whole quasi-biclique. If we select a vertex x ∈ X ′ opt ,thenx and D(x, Y ′ ) form a biclique of size at least 1 When the value of δ is big with respect to ǫ, we do not have the desired quasi-biclique. If we try to add more vertices from Y ′ ,wehave to guarantee that for every selected vertex y in Y ′ , y is incident to at least (1 − δ)|X ′ | selected vertices in X ′ . This is impossible if x is the only selected vertex from X ′ . Therefore, we have to consider to add more vertices from both X ′ and Y ′ . It is clear that the task here is non-trivial.
In the following lemma, we will show that there exists a subset of r vertices (for some constant Here r and ǫ ′′ are closely related.

Will-be-set-by-IN-TECH
Based on Lemma 3, we can design an algorithm that finds a quasi- Here we do not know |Y ′ opt |. However, we can guess the value of |Y ′ opt | by trying |Y ′ opt | = 1, 2, . . . , |Y ′ |. The integer programming problem formulated by (3) and (4) has no objective function and we just want a feasible solution to fit (3) and (4). The integer programming problem is hard to solve. However, we can obtain a fractional solutionȳ i for (3) and (4) with 0 ≤ȳ i ≤ 1 in polynomial time. After obtaining the fractional solutionȳ i ,werandomlysety i to be 1 with probabilityȳ i .

Lemma 4. Assume that
With probability at least 1 − 1 r , we can get a pair of subsets X A ⊆ X ′ and Y A ⊆ Y ′ (an integer solution) by randomized rounding according to the probabilityȳ i such that X A and Y A form a quasi-(δ + 4ǫ ′ )-biclique with A standard method in (Li et al., 2002) can give a de-randomized algorithm.
We also solve the integer linear programming (3) and (4) in the same way as in Case 1.1. The algorithm for Case 1 is given in Fig. 1.
.W e will use a quadratic programming approach to solve the problem. We can formulate the quasi-biclique problem for the bipartite graph G ′ =(X ′ ∪ Y ′ , E ′ ) into the following quadratic programming problem.

Quadratic programming formulation:
Let x i and y j be 0/1 variables, where x i = 1 indicates that vertex v i in X ′ is in the quasi-biclique and y j = 1 indicates that vertex u j in Y ′ is in the quasi-biclique. Define e i,j = 1if(v i , u j ) ∈ E ′ and e i,j = 0 otherwise. Let c 1 and c 2 be two integers representing the sizes of X ′ opt and Y ′ opt , respectively. We can guess the values of c 1 and c 2 in polynomial time though we do not know c 1 and c 2 . We have the following inequalities: (5) and (6) indicate that x i > 0a n dy i > 0 imply that ∑ Letx i andŷ j be the 0/1 integer solution for the quadratic programming problem (5)-(8). Let To deal with the quadratic programming problem, the key idea here is to estimate the values ofr i andŝ j . If we know the values ofr i andŷ j , then (5) and (6) become wherer i andŝ i in (9) and (10) are constants and the quadratic inequalities become linear inequalities.
The approach for giving a good estimation ofr i andŝ i is to randomly and independently select a subset B X and Y ′ opt , respectively. Let c 1 = |X ′ opt | and c 2 = |Y ′ opt |. We do not know c 1 and c 2 ,b u tw e can guess them in O(|X ′ |×|Y ′ |) time. Then we can use c 1 k ∑ v j ∈B X ′ e i,j and c 2 k ∑ u j ∈B Y ′ e i,j to estimater i andŝ i , respectively. Since we do not know X ′ opt and Y ′ opt , it is not easy to randomly and independently select vertices from X ′ opt and Y ′ opt . We develop a method to randomly Here p is a constant to be determined later.
The idea here is to randomly and independently select a subset B of (c + 1) × p × log |Y ′ | vertices from Y ′ and enumerate all size p × log |Y ′ | subsets of B in time . We can show that with high probability, we can get a set of p log |Y ′ | vertices randomly and independently selected from Y ′ opt .

Lemma 5. With probability at least
Proof. Let us consider the probability that B contains less than p log |Y ′ | vertices in Y ′ opt .L e t b be the expected number of vertices in B that are also in Y ′ opt .R e c a l lt h a t|Y ′ | = c|Y ′ opt |.I f we randomly select a vertex in Y ′ , the probability that the vertex is in Y ′ opt is 1 c .L e tμ be the expected number of vertices in B that are in Y ′ opt .W eh a v eμ = |B| c = 1 c ⌈(c + 1)p log |Y ′ |⌉.Let X 1 , X 2 ,...,X |B| be |B| independent random 0/1 variables, where X i = 1 with probability 1 and Since we selected (c + 1)p log |Y ′ | vertices, Based on Lemma 1, we have (11), (12) and (13)) Therefore, with probability at most |Y ′ | − p 2c 2 (c+1) , B does not contain any size p log |Y ′ | subset of Y ′ opt . This completes the proof.
Let B X ′ and B Y ′ be the sets of randomly and independently selected vertices in X ′ opt and Y ′ opt . Let |B X ′ | = p 1 log |X ′ | and |B Y ′ | = p 2 log |Y ′ |.W ed e fi n er i = ∑ v j ∈B X ′ e i,j ands i = ∑ u j ∈B Y ′ e i,j . The following lemma shows that c 1 |B X ′ |r i and c 2 |B Y ′ |s i are good approximations ofr i andŝ i .

121
Mining Protein Interaction Groups www.intechopen.com The term (1 − ǫ) in (14) and (15) ensures that the quadratic programming problem has a solution when the estimated values of r i and s i are smaller thanr i andŝ i . Similarly, the term (1 + ǫ) in (18) and (19) ensures that the quadratic programming problem has a solution when the estimated values of r i and s i are bigger thanr i andŝ i .

Randomized rounding
Let x ′ i and y ′ j be a fractional solution for (14) - (19). In order to get a 0/1 solution, we randomly set x i and y j to be 1 using the fractional solution as the probability. That is, we randomly set x i and y j to be 1's with probability x ′ i and y ′ i , respectively. (Otherwise, x i and y j will be 0.)
The complete algorithm for Case 2 is given in Fig. 2. Let k = ⌈δt⌉ as defined in Lemma 2.

Theorem 4. With probability at least 1 − o(1), Algorithm 2 finds a quasi-(δ
We can derandomize the algorithm to get a polynomial time deterministic algorithm. Step 3 can be derandomized by using the standard method. For instance, instead of randomly and independently choosing p 1 log(|X ′ |) and p 2 log(|Y ′ |) vertices from X ′ and Y ′ , we can pick the vertices encountered on a random walk of the same length on a constant degree expander. Obviously, the number of such random walks on a constant degree expander is polynomial. Thus, by enumerating all random walks of length p 1 log(|X ′ |) and p 2 log(|Y ′ |),w eh a v ea polynomial time deterministic algorithm.

Theorem 5. There exists a polynomial time approximation scheme that outputs a quasi-biclique X
vertices in X A for any ǫ > 0,whereX opt and Y opt form the optimal solution.

The heuristic algorithm
In practice, we need to find large quasi-bicliques in PPI networks. Here, we propose a heuristic algorithm to find large quasi-bicliques. Consider a PPI network G =( V, E ). Our heuristic algorithm has two steps. First, we construct a bipartite graph from the graph G based on a pair of interacting proteins (u, v). Using the method described at the beginning of Section 2, we can get a bipartite graph G =( X ∪ Y, E). Second, we find quasi-bicliques in G.T h e bipartite graph G contains all proteins that have interactions with u or v. So we can find large quasi-bicliques containing u and v in the bipartite graph.
In the algorithm for finding quasi-bicliques in G,w eh a v et w op a r a m e t e r sδ and τ,w h i c h control the quality and sizes of the quasi-bicliques. We use a greedy method to get the seeds for finding large quasi-bicliques in G. At the beginning, we set X ′ = φ and Y ′ = Y.I ne a c h step, we find a vertex with the maximum degree in X − X ′ . The vertex is added into the biclique vertex set X ′ , and we eliminate all vertices y in Y ′ such that d(y, X ′ ) < (1 − δ)|X ′ |. We will continue this process until the size of Y ′ is less than τ. At each step, we get a seed for finding large quasi-bicliques. The seeds may miss some possible vertices in the quasi-bicliques. We can extend the seeds to find larger quasi-bicliques. Let X ′′ = X ′ and Y ′′ = Y ′ be a pair of seed vertex sets. In the first step, we can find a vertex x in X − X ′′ with the largest degree d(x, Y ′′ ) in X − X ′′ .I f 123 Mining Protein Interaction Groups www.intechopen.com d(x, Y ′′ ) ≥ (1 − δ)|Y ′′ |,weaddthevertexx to X ′′ . In the second step, we can find a vertex y in Y − Y ′′ with the largest d(y, X ′′ ) in Y − Y ′′ .I fd(y, X ′′ ) ≥ (1 − δ)|X ′′ |,weaddthevertexy to Y ′′ . We repeat the above two steps until no vertex can be added. The whole algorithm is shown in Fig. 3. We can also exchange the two vertex sets X and Y to find more quasi-bicliques using the algorithm.
Let n be the number of vertices in the bipartite graph G. In the greedy algorithm, the time complexity of Steps 3 − 5andStep10isO(n), and the time complexity of Steps 6 − 9isO(n 2 ). So the time complexity of Steps 3 − 10 is dominated by O(n 2 ).S i n c eS t e p s3− 10 is repeated O(n) times, the time complexity of the whole algorithm is O(n 3 ).
Find the vertex y ∈ Y − Y ′′ with the maximum degree d(y, X ′′ ).I fd(y, until no vertex is added in the steps 7 and 8. 10.

Finding motifs from the multiple sequence alignment of computed δ-bicliques.
We implemented the heuristic algorithm described in the last section in JAVA. The software is called PPIExtend. In the implementation, we added a new parameter α to speed up the algorithm. In Step 3, instead of selecting one vertex with the best degree, we can select the best α vertices in X − X ′ and add all the α vertices into X ′ in Step 4. As shown in the last step of the algorithm, some vertices in X ′′ may be adjacent to less than (1 − δ)|Y ′′ | vertices in Y ′′ , but the average degree of the vertices in X ′′ is no less than (1 − δ)|Y ′′ |. Similarly, some vertices in Y ′′ may be adjacent to less than (1 − δ)|X ′′ | vertices in |X ′′ |, but the average degree of the vertices in Y ′′ is no less than (1 − δ)|X ′′ |. In our experiments, these quasi-bicliques are still output to get more useful quasi-bicliques.
Our algorithm for PPIExtend consists of two steps: (i) find interacting protein group pairs (quasi-bicliques) using the greedy algorithm, (ii) find conserved motifs from multiple sequence alignments for each of the protein groups. (We use the existing multiple sequence alignment software PROTOMAT (Pietrokovski, 1996).) The motifs found by PROTOMAT can be viewed as a block, that is a conserved region in a multiple sequence alignment of the proteins in a group. For each biclique X and Y obtained by the greedy algorithm, we use S X and S Y to denote the sets of motifs obtained by the multiple sequence alignments of protein sequences in X and Y, respectively. Any pair of motifs (m 1 , m 2 ) with m 1 ∈ S X and m 2 ∈ S Y is a candidate protein-protein interaction motif pair. Thus, our algorithm can also output lots of motif pairs as candidate protein-protein interaction motif pairs.
We look at the numbers of motifs found by the programs PPIExtend and FPClose * that are also in the two block databases, BLOCKS (Pietrokovski, 1996) and PRINTS (Attwood & Beck, 1994). The LAMA program (Pietrokovski, 1996) is used to find the local optimal alignment of two blocks (the motif output by PPIExtend/FPClose * and a block in the databases), where the Z-score is computed to measure the alignments. The default threshold of Z-score was used in the experiments. The results are reported in Table 1. From this table, we can see that our method has more mappings to BLOCKS and PRINTS than FPClose * (Li et al., 2006;Grahne & Zhu, 2003   We look at the numbers of motif pairs found by the two programs PPIExtend and FPClose * that can be mapped into domain-domain interaction pairs in the domain-domain interaction database iPfam (Finn et al., 2005). The versions of the databases are shown in Table 2. The iPfam database is built on top of the Pfam database (Sonnhammer et al., 1997) which stores the information of protein domain-domain interactions. To examine whether the motif pairs found by PPIExtend and FPClose * can match some pairs of interacting domains in iPfam, we map our motif pairs to domain pairs in iPfam through the integrated protein family database InterPro (Apweiler et al., 2001) which integrates a number of databases. In fact, we strictly follow the procedure as suggested in (Li et al., 2006). (1) We map our motifs to domains (protein groups) in the database BLOCKS or PRINTS; (2) we map a protein group of BLOCKS to a protein group of InterPro based on the one-to-one mapping between an entry of BLOCKS and an entry of InterPro; (Note that both PRINTS and Pfam are member databases of InterPro, and the mapping between PRINTS and Pfam is clear.) (3) we use existing cross-links between protein groups of InterPro and domains of Pfam to determine the crosslinks between the motifs found by PPIExtend/FPClose * and Pfam domains. In this way, we can map our motif pairs into domain pairs with Pfam domain entries. Note that the mapping between motif pairs and domain pairs is not one-to-one.
We observed that the motif pairs found by PPIExtend can map to 81 distinct domain pairs in iPfam. However, only 18 domain pairs were reported in (Li et al., 2006). This is a significant improvement and the main reason is the use of quasi-bicliques. In the 81 domain pairs, 48 pairs are domain-domain interactions on one protein (self-loops) and 33 pairs are domain-domain interactions on different proteins. Although the self-loops form a large portion, we still find many other domain-domain interactions that are not self-loops.

Protein interaction sites: a case study
In this section, we present detailed information about binding motif pairs that can be mapped to interacting domain pairs. The first motif pair is derived from a protein group pair in which the left protein group contains 7 proteins and the right protein group contains 10 proteins. There are 66 interactions between the two groups of proteins.
Using the hypergeometric probability model, the p-value of the protein group pair is less than 1.57 × 10 −  Table 3 for more details. Our binding motif pair can map into the domain pair (PF00672, PF01036) in iPfam. iPfam shows that the HAMP domain interacts with the Bac_rhodopsin domain in protein complexes such as lh2s. 1h2s is the complex of Natronobacterium pharaonis sensory rho-dopsin II (sRII) with receptor-binding domain of HtrII. The X-ray structure of 1h2s was obtained at 1.93 Å resolution (Gordeliy et al., 2002) and it provided an atomic picture of the first step of the signal transduction. The interactions in the sRII-HtrII complex have been intensively investigated to find the signal relay mechanism from the receptor to the transducer (Bergo et al., 2005;Inoue et al., 2007;Sudo et al., 2007). The 3D structure of the interactions is shown in Fig. 4(a) and 4(b), which are generated by Protein Explorer (Martz, 2002). The shortest residue-residue distance between the two motifs in a pair is also interesting. In protein complex 1h2s, there are two chains: chain A (1h2s_A) and chain B (1h2s_B). The left motif is located at positions 168 − 186 of 1h2s_A, and the right motif is located at positions 61 − 69 of 1h2s_B (Table 3). We downloaded the coordinate information of 1h2s from http://www.ebi.ac.uk/msd-srv/msdlite/atlas/summary/1h2s.html, and computed the residue-residue distances between the two motifs. The shortest residue-residue distance is 4.07 Å between atom 1346 of residue 177 in 1h2s_A and atom 2018 of residue 69 in protein 1h2s_B (Fig. 4(b)). The average shortest residue-residue distance is 9.17Å. From these (a) The 3D structure of 1h2s (asymmetric unit). (b) The backbone structure of the two motifs in 1h2s.

Prediction of binding sites
After obtaining candidate domains (conserved regions) in multiple sequence alignment, we can further verify if a pairs of predicted domains really interact with each other by using some tools for protein binding site prediction. Here we briefly introduce a method originally in (Guo & Wang, 2011). This method assumes that the 3D structures of the two given proteins are known.
Given two complete protein structures, the task is to find the binding sites between the two proteins. The method contains three steps. Firstly, we do local sequence alignment at the atom level to get the alignments of conserved regions. Those alignments of conserved regions may contain some gaps. Secondly, among the conserved regions obtained in Step 1, we use the 3D structure information to identify the surface segments. Finally, for any pair of the surface segments identified in Step 2, we compute a rigid transformation to compare the similarity of the two substructures in 3D space and output the qualified pairs as binding sites. When computing the rigid transformations, we treat each protein as a molecule with some volume and introduce a method to ensure that the two whole protein 3D

Conclusion
We have proposed algorithms for finding the maximum vertex quasi-biclique problem. We illustrate the applications of the proposed algorithms for finding protein-protein binding sites. The general approach contains three steps: (1) find quasi-bicliques from PPI networks; (2) do multiple sequence alignment for each of the groups in the quasi-biclique and identify possible domains on the protein sequences.
(3) use other methods, e.g., the one in (Guo & Wang, 2011), to further confirm the binding sites. Proteins are indispensable players in virtually all biological events. The functions of proteins are coordinated through intricate regulatory networks of transient protein-protein interactions (PPIs). To predict and/or study PPIs, a wide variety of techniques have been developed over the last several decades. Many in vitro and in vivo assays have been implemented to explore the mechanism of these ubiquitous interactions. However, despite significant advances in these experimental approaches, many limitations exist such as falsepositives/false-negatives, difficulty in obtaining crystal structures of proteins, challenges in the detection of transient PPI, among others. To overcome these limitations, many computational approaches have been developed which are becoming increasingly widely used to facilitate the investigation of PPIs. This book has gathered an ensemble of experts in the field, in 22 chapters, which have been broadly categorized into Computational Approaches, Experimental Approaches, and Others.

How to reference
In order to correctly reference this scholarly work, feel free to copy and paste the following: Lusheng Wang (2012 http://www.intechopen.com/books/protein-protein-interactions-computational-and-experimental-tools/miningprotein-interaction-groups