A Pattern Search Method for Discovering Conserved Motifs in Bioactive Peptide Families

Bioactive peptides play critical roles in regulating most biological processes in animals, and they have considerable biological, medical and industrial importance. Peptides belonging to the same family are often characterized by a typical short sequence motif (pattern) that is highly functionally preserved among the family members. In this chapter, we design a pattern search method to facilitate the detection of such conserved motifs. First, all known bioactive peptides annotated in Uniprot are collected and classified, and the program Pratt is used to search these unaligned peptide sequences in each family for conserved patterns. The obtained patterns are then refined by taking into account the information on amino acids at important functional sites collected from literature, and are further tested by scanning them against all the Uniprot proteins. The diagnostic power of the patterns is demonstrated by the fact that, while the false positive is kept to zero to ensure that the signatures are exclusive to peptides and their precursors, nearly 94% of all known peptide family members accommodate one or several of the identified patterns. In total, we brought to light 155 novel peptide patterns in addition to the 56 established ones in the PROSITE database. All the patterns represent 110 peptide families; among which 55 are not characterized by PROSITE and 12 are also dismissed by other existing motif databases, such as Pfam. Using the newly uncovered peptide patterns as a search tool, we predicted 95 hypothetical proteins as putative peptides or peptide precursors.


Introduction
Bioactive peptides play critical roles in regulating most biological processes in animals, and they have considerable biological, medical and industrial importance.Peptides belonging to the same family are often characterized by a typical short sequence motif (pattern) that is highly functionally preserved among the family members.In this chapter, we design a pattern search method to facilitate the detection of such conserved motifs.First, all known bioactive peptides annotated in Uniprot are collected and classified, and the program Pratt is used to search these unaligned peptide sequences in each family for conserved patterns.The obtained patterns are then refined by taking into account the information on amino acids at important functional sites collected from literature, and are further tested by scanning them against all the Uniprot proteins.The diagnostic power of the patterns is demonstrated by the fact that, while the false positive is kept to zero to ensure that the signatures are exclusive to peptides and their precursors, nearly 94% of all known peptide family members accommodate one or several of the identified patterns.In total, we brought to light 155 novel peptide patterns in addition to the 56 established ones in the PROSITE database.All the patterns represent 110 peptide families; among which 55 are not characterized by PROSITE and 12 are also dismissed by other existing motif databases, such as Pfam.Using the newly uncovered peptide patterns as a search tool, we predicted 95 hypothetical proteins as putative peptides or peptide precursors.

Problem statement and background
Whole genome sequencing projects have made available immense sequence data at a pace that far supersedes their rate of annotation.As a result, out of 1.7 million protein sequences, which are currently available for all the completely sequenced metazoan genomes, nearly 15% could not be assigned to any putative function.Although several tools/algorithms are available to contribute towards the putative functional assignments of the proteins, yet large numbers of proteins remain un-elucidated.In most cases this is due to the low degrees of sequence similarities with known proteins; alternatively, the existing similarities can be confined to only very small part(s) of the entire protein.The latter is especially true for precursor proteins coding for bioactive peptides.Consequently, there is still a need for bioinformatic tools to predict the function of the enormously large number of the unknown protein sequences.Bioactive peptides occur in the whole animal kingdom, from the least evolved phyla to the highest vertebrates (Filipsson et al., 2001;Masashi et al., 2001).They play key roles as signaling molecules in many, if not all physiological processes, for instance as a peptidergic neurotransmitter or neurohormone, as a peptidergic toxin, or as a growth factor (Boonen et al., 2007;Boonen et al., 2010).They are synthesized in the cell in the form of large preproproteins (precursors), which are a special class of proteins as they undergo extensive post-translational processing prior to producing final mature bioactive peptides (Schoofs & Baggerman, 2003).Peptides and their precursors that are structurally and functionally related have been classified into peptide families; each family of proteins is assumed to be derived from a common ancestor (Husson et al., 2009).During the evolutionary process, the protein sequences may have much diverged, but the essential amino acids involved in the biologically important activities are still present.These conserved amino acids along with their particular sequential order form the functional foundation and represent the motif (pattern) of a peptide family.However, over the course of natural adaptation, different peptide families have diverged at different rates.While for some peptide families, the similarity extends over a much longer region even over the entire peptide precursor sequences; for many others, a short highly conserved motif is responsible for the function of the precursor proteins throughout the family members, and the sequence fragments outside the conserved regions often display no significant similarities (Baggerman et al., 2005).The latter conserved sequence characteristics can be further exposed by many short but biologically important functional peptides released from known large precursors as annotated in Uniprot, such as the 3-amino-acid thyroliberin peptide 'QHPamide' (Vandenborne et al., 2005) and 4-amino-acid neuropeptides 'FMRFamide' (Baggerman et al., 2002).For some mature peptides, the precursor proteins (genes) are unknown, such as the 2-amino-acid neuropeptide 'GWamide' (P83570) from Sepia officinalis (Henry et al., 1997) and the human growth-modulating peptide 'GHK' (P01157) (Schlesinger et al., 1977).The existence of numerous short bioactive peptides within the precursor proteins implies that only a very small conserved peptide motif may be a biologically important functional portion of the precursors.Due to the fact that only short sequence regions are conserved, peptides or their precursors are sometimes not identified by existing sequence alignment algorithms e.g.BLAST or by motif search methods.While BLAST programs (Altschul et al., 1997) are very suitable to scan databases for homologous proteins, they are far less efficient at finding similarities to short conserved regions which can be only a few amino acids in length, when the whole genome sequence is scanned.For large precursors which are usually a few hundred amino acids in length and for which the biologically conserved regions are limited, the important domains are often masked by long randomly unrelated sequence regions.This is because for any two random large protein sequences, BLAST usually can find a relative long local alignment, at least longer than the short conserved peptide motif, and BLAST tends to assign a higher score to a longer alignment (Durbin et al., 1998).In addition, if a pair of homologues involves a short independent peptide molecule, which may be either an unknown peptide sequence as query or a known mature peptide as target from a protein database, it is difficult for BLAST to detect the pair of homologues, because the involvement of a short sequence makes the pairwise sequence alignment less likely to obtain a significant BLAST score (e.g., e-value < 0.01).
Like BLAST, motif search methods are important tools to search for a protein in a database, nevertheless, they are also limited to detect all members from a characterized peptide family.Most of the motifs in the existing databases, e.g.PROSITE (Hulo et al., 2004) and Pfam (Finn et al, 2010), cover the entire precursor sequences or sequence domains which are much longer than the conserved bioactive peptide regions.Therefore, the database motifs show their weakness when they are used to detect short mature peptides for which the precursors are unknown and the information on the sequences outside the peptide regions is thus missing.In addition, the construction of these motifs requires a good multiple protein sequence alignment in order to produce an accurate signature.This works well when the sequences are easy to align.However, for some peptide families for which the conserved regions are very short and the bulk of peptide precursor sequences is not very well preserved, the multiple alignment is very difficult to obtain or evaluate.The overall precursor protein sequence identity, especially in distantly related homologues, may be too low for an accurate alignment.In some cases, the short conserved regions are repeated within a precursor, making it even more challenging to build a unique alignment that truly reflects the evolutionary relationship.In this chapter, we have followed an alternative approach, taking unaligned sequences as a starting point.We then used a pattern search program to look for conserved patterns.We first collected all currently annotated peptides and peptide precursor proteins in Metazoa through a search in Uniprot and classified them into peptide families.Next, we extracted peptide sequences in each family and used the program Pratt to search the sequences for representative patterns.Such patterns consist of highly conserved positions that can be separated by fixed or variable spacing.The patterns are then refined by incorporating the information that is available in literature on the important amino acids contained within the biologically active site(s) of the peptides.The specificity of the generated patterns are further verified by scanning them against Uniprot in order to ascertain that proteins picked up by the patterns are either annotated as peptides or peptide precursor proteins or have an unknown function.

Peptide precursor collection and classification
A protein was collected into a peptide-precursor database if it is annotated in the Uniprot protein database (release 6.6) consisting of Swiss-Prot (release 48.6) and TrEMBL (release 31.6) with one of the following keywords: hormone, antimicrobial, toxin.The hormone includes bombesin, bradykinin, cytokine, glucagon, growth factor, hormone, hypotensive agent, insulin, neuropeptide, neurotransmitter, opioid peptide, pyrokinin, tachykinin, thyroid hormone, vasoactive, vasoconstrictor and vasodilator (the definition of the keywords can be referred to in this database).The antimicrobial consists of antibiotic, antiviral defense, defensin and fungicide; while the toxin includes naturally produced and secreted poisonous proteins that damage or kill other cells.However, when the protein is also characterized by non-peptide keywords, such as receptor, signal-anchor, transmembrane, binding protein, DNA binding, nuclear protein, transport, collagen, enzyme or words ending in 'ase' (excluding 'disease'), it is excluded, in order to avoid the selection of proteins which are not peptides or peptide precursors.Stand-alone PSI-BLAST (ftp://ftp.ncbi.nih.gov/blast/executables/) is then used to align all the assembled sequences with all the Uniprot proteins except the ones which are already in the peptide-precursor database.Based on the conserved sequence characteristics of peptide families, the score matrix PAM30 is used and the word size is set to 2, allowing for the search for short but strong similarities.The proteins, which show significant similarities (evalue <0.01) with the known peptides or precursors, are retained.The obtained list is then checked manually in terms of the proteins' cellular location, molecular function and biological process as stated by GO (gene ontology) terms or in literature.As a result, 1345 more proteins which have as yet not been annotated in Uniprot are added to the peptideprecursor database.Proteins collected in this database are automatically classified into peptide families if their family classification information is available in Uniprot that is based on a significant match to an existing motif or based on sequence similarities.Otherwise, proteins that display sequence similarities with a significant BLAST score, are clustered into the same family.A protein can also be assigned to a particular family based on its molecular function described in literature.

In silicon extraction of peptides
From each precursor protein in a peptide family, the bioactive peptide sequences are extracted in silicon from the beginning and ending positions of the subsequences that are annotated as 'peptide' or 'chain' in 'feature' line in the corresponding protein file in Uniprot.The conserved basic cleavage sites flanking the peptides, which contribute to the endoproteolytic cleavage process of the peptides from their precursors, such as the monobasic site (G)R or (G)K, the dibasic sites (G)KR, (G)RR, (G)KK or (G)RK, or a combination of consecutive K or R, are also withdrawn along with the subsequences (Liu & Wets, 2005;Rouille et al., 1995).Entries in the family that only constitute the peptide sequence, i.e. in those cases where the precursor is unknown, are also retained.Proteins less than 200aa (amino acids) in length, which contain an N-terminal signal peptide and for which no mature peptides have as yet been identified, presumably contain a single peptide and are therefore also deposited after in silicon removal of the N-terminal signal peptide.According to the statistics on all annotated bioactive peptide sequences in Uniprot, 97% are no longer than the 200aa threshold value.The presence of a signal peptide is assumed when it is indicated in Uniprot; in other cases, it is forecasted by the signal peptide prediction program signalP (http://www.cbs.dtu.dk/services/SignalP/).In total, 110 datasets of peptide families are formed with each including at least 10 peptide sequences.All the extracted peptide sequences in each of the families were scanned independently for patterns conserved in the corresponding family.

Method
Different software available on the internet provides users the tools to search for patterns conserved in a set of unaligned protein sequences.Pratt (http://www.ebi.ac.uk/pratt/#) (Jonassen et al., 1995) is a flexible pattern search tool in the number of parameters that can be controlled by users.It allows searching for patterns of conserved positions with limited variable length spacing, which is important because even in well-conserved peptide regions, variable loop sizes can occur.Pratt is run on each of the peptide family datasets, and the searching parameters are set based on maximum pattern length and pattern flexibilities found in the existing peptide patterns in PROSITE.
For each Pratt run which starts with the minimum percentage of sequences to match the pattern (the parameter C%) equal to 90%, the most significant pattern, which is the one with the highest fitness in the Pratt output list, is retained.The obtained pattern is then refined by integrating the information on the important functional sites in the matched peptide sequences depicted in literature.The amino acids occurring at these sites are added to the pattern if they are absent at the corresponding sites in the pattern.The pattern is further verified by scanning it against all the Uniprot proteins using the ScanProsite tool (http://www.expasy.org/tools/scanprosite/).Two possible cases occur: (1) If the pattern is not contained in any known non-peptide protein, it is retained as a conserved peptide pattern.( 2) Otherwise, if the pattern is matched by both peptide and nonpeptide proteins (further referred to as true and false positive hits, respectively), it is subsequently processed as follows.(2a)If the pattern does not include any wildcard region where any amino acid is accepted, the positions where the pattern is located in all matching protein sequences are checked.If the pattern exclusively occurs at the N-or C-terminus of the true positive hits, or if the peptide proteins are all small molecules, the pattern is retained with a constraint ('<' or '>') imposed at the N-or C-terminus of the pattern to limit the maximum distance between the conserved pattern region and the N-or C-terminus of the peptide or precursor protein.If the pattern with such a restriction cannot distinguish the true positives from the false ones, the pattern is eliminated.(2b)Or, if the pattern has wildcard regions, the sequence fragments corresponding to the pattern in all the matching sequences are extracted and aligned.If the two groups of amino acids in a wildcard region X in this alignment have different physicochemical properties between the true and the false positive hits, the region X is replaced by the group of amino acids distinctively occurring in the true positive proteins.In the other case, when the two groups of amino acids share identical physicochemical properties, the pattern is discarded.The amino acid symbol sets: DE, KRH, NQ, ST, ILV, FWY, AG, C, M and P, which are classified based on the physiochemical nature of the side groups (Smith & Smith), are used.If a conserved pattern cannot be obtained, the parameter C% is reduced by 10%, and Pratt is re-run against the same dataset.As the percentage of sequences to match the pattern decreases, a pattern which is usually longer and contains more sites than the previously one is shown up and processed by similar refinement and verification.The procedure is repeated until a pattern, which represents the majority of a group of related peptide sequences and rules out any known non-peptide proteins, is discovered.Once a conserved pattern is identified in the peptide family dataset, the program ps-scan (ftp://ftp.expasy.org/databases/prosite/tools/ps_scan/sources/) is run locally on the pattern against this dataset.The sequence regions which match the pattern are removed from the original peptides.Each of the two remaining parts of the peptide sequences at their N-and C-terminus is left to form an independent sequence if it is not less than 4aa in length, given the assumption that the minimum length of the peptide pattern we search for is not less than this value.Thus, a reduced dataset is created including not only the peptides which are not covered by the identified pattern, but also the remaining sequences of the original peptides that match the pattern.This methodology is based on the fact that a peptide precursor protein may contain several conserved regions, and that our extracted peptide sequences include long peptide chains which may contain a few shorter, unrelated, bioactive peptides.The reduced peptide family dataset is then scanned by Pratt to discover the next pattern.The search procedure is repeated until the parameter C% is less than 50%.This means that the remaining dataset contains no more patterns representing the majority of the sequences.Fig. 1 represents the scheme of the described pattern searching procedure which is aimed to examine short bioactive peptide sequences rather than their large precursor molecules, and to take into account not only the biologically functional sites of each individual peptide discussed in literature, but also the general information which is extracted by the computational tool Pratt from all related peptides in a family.

'PeptideMotif' database
We have built a peptide-precursor database consisting of 11,688 peptides and precursor proteins originated from 1420 metazoan organisms; of which 11,437 proteins (98%) are categorized into 110 distinctive peptide families.Based on bioactive peptide sequences drawn from the peptide families, we uncovered in total 211 conserved patterns which are assembled into the peptide motif database 'PeptideMotif'.All the patterns range between 4 and 52 amino acids (column) in length with 78 (37%) no longer than 10aa.While each of the patterns covers most of the peptides or precursors belonging to the corresponding family, the false positives are kept to zero because it is guaranteed by the criterion that a known protein matching the pattern is indeed a peptide or precursor protein from this family.

Comparison with the other motif databases
The PROSITE database (http://ca.expasy.org/prosite) is a motif database of protein families and domains.It consists of biologically significant sites, patterns and profiles that help to reliably identify to which known protein family (if any) a new sequence belongs.Its 19.9 release contains 56 entries (patterns) describing 55 peptide families in Metazoa (the omegaatracotoxin family has two patterns) belonging to categories of cytokines and growth factors, hormones and active peptides, and toxins.All the 55 families are also covered by patterns in the 'PeptideMotif' database, and these peptide patterns (Table 1) share the similar length to their PROSITE counterparts.However, in terms of conserved sequence characteristics revealed in both database motifs, more amino acids are imposed at the conserved sites or wildcard regions in the 'PeptideMotif' patterns.This is due to the fact that the identified peptide patterns are not only trained by running them against the Swiss-Prot protein database which is also used as the test dataset by PROSITE, but also against the TrEMBL database, in which many proteins are also annotated by keywords or literature.In addition, for 25 of the 56 families, we have found 34 additional novel patterns and they are marked as 'new' in Table 1.The remaining 121 'PeptideMotif' patterns presented in Table 2 allow the identification of 55 peptide families that are untouched by PROSITE signatures; they cover 3866 bioactive peptide sequences cleaved from 3572 precursors.Among the patterns, 28 representing 12 families are also not characterized by any other motif database, such as Pfam (Bateman et al., 2004) and CDD (Marchler-Bauer et al., 2005).The sequence reminiscence for these families is short and often occurs repeatedly within a same precursor protein.The sequences outside the conserved region are not well preserved, and thus a probability model based on protein sequence alignments cannot efficiently characterize such peptide families.Fig. 1.Procedure for searching patterns in peptide sequences.Note: The parameters are set as follows: the maximum pattern length (PL) is 52, the maximum length of a wildcard (PX) is 15, the maximum number of flexible wildcards (FN) is 3, the maximum flexibility of a flexible wild card (FL) is 8, the upper limit on the product of flexibilities for a pattern (FP) is 48, the minimum percentage of sequences to match the pattern (C%) is 90, 80, 70, 60 and 50%, respectively, and all other parameters are at default.(2)HBGF/FGF; (1) (5)Platelet-derived growth factor (PDGF); TGF-beta; (1) interferon alpha, beta and delta;     2)-P; {29, 29, 1}; {Q9VJL6, Q29NM1}

Hormones (10) ACTH_domain and opioid neuropeptides
Table 2.The novel conserved peptide patterns.Note: each family is described in the following items: (1) the name of the family; (2) all identified patterns; patterns marked with 'identical' are completely identical to their PROSITE counterpart and the ones marked as 'new' are novel to PROSITE in Table 1; (3) the number of true positive peptide or precursor proteins, the number of matches to the pattern, and the number of false negative hits, all these numbers are in a bracket; (4) if there are novel putative peptides or precursors predicted by the patterns of the family, they are listed in a second bracket.

Case study
Patterns respectively representing the family of opioid and POMC-derived peptides as well as the FMRFamide and related neuropeptides (FARPs) are here shown as test cases in order to provide insights into the conserved sequence characteristics in many know peptide families and how the peptide patterns deduced based on these characteristics perform.Fig. 2. Sequence alignments between Q28409 and P01210/Q8AX66/Q4RIZ7 by BLAST.Notes: the conserved opioid peptide sequence similarities are in bold.
No signature represents the subfamily in PROSITE; three Pfam motifs explain the proteins including PF08384 (45 columns), PF00976 (41 columns) and PF08035 (31 columns).These motifs capture separate conserved regions located respectively at the N-ternimus of the precursors the removal of the signal peptide, at the sequences coding for ACTH and for 'beta-endorphin' peptides.However, the remaining parts of the precursors encoding for peptides of gamma-MSH (12aa) and beta-MSH (17aa) are left untouched.As a result, 27 mature peptides or sequence fragments, e.g.Q9PRN3 from the Sea lamprey, horse P01202 and leech P41989, cannot be detected by any of these Pfam motifs.
Query=Q28409|PENK_FELCA Proenkephalin A-Felis silvestris catus(Mammalia) Length=187 > P01210|PENK_HUMAN Proenkephalin A precursor -Homo sapiens (Mammalia) Length=267 Score = 429 bits (1004), Expect = 1e-118 The BLAST alignment between Q9PRN3 and all proteins in the nr database unveils that, although Q9PRN3 cannot be identified by the Pfam motifs, it shares the highly conserved 'PeptideMotif' pattern 'Y-x-[MV]-x-H-F-R-W' with other POMC subfamily members, e.g., Q2L6A9 from Hyperoartia, P01193 and Q53WY7 from Mammalia, and Q32U15 from Amphibia (Fig. 4).This 8-column peptide pattern is a part of the 41-column Pfam motif PF00976.While the sequence region, which is described by this Pfam motif, may be an entire functional or structural domain, this peptide pattern contained within the longer domain is probably the most essentially functional part.In total, our procedure identifies six novel peptide patterns in the combination of these two subfamilies.Among all the 397 proteins in this family, 113 were found to contain two of the peptide patterns, and the rest match one of them.These patterns characterize conserved domains located at different regions of a precursor sequence, and each of them can exclusively represent an opioid or POMC peptide or its precursor protein.

FMRFamide and related neuropeptides (FARPs)
It is widely known that FARPs occur throughout the whole animal kingdom and therefore this family is an ideally suited test case to check whether the disclosed pattern is capable of retrieving FARPs from all metazoan species (Ubuka et al., 2009).In total, 23 conserved peptide patterns have been uncovered from the family, and they match 214 FARPs sequences with 605 hits due to the presence of multiple copies of the conserved patterns within some precursor proteins.The identified FARPs distribute among a wide range of phyla, including Nematoda (85), Arthropoda (50), Mollusca (24), Annelida (9), Platyhelminthes (1), Cnidaria (10) and Chordata (35).An 11-column Pfam motif PF01581 characterizes FARPs from all above-mentioned phyla except Chordata, e.g.human Q9HCQ7 and mouse Q9WVA8.In addition, conversely to the 'PeptideMotif' patterns, 49 FARP peptides or precursor proteins in these characterized phyla, e.g., Q9TWD2 from Lymnaea stagnalis and Q95QP2 from Caenorhabditis elegans, cannot be revealed by the Pfam motif with a significant score (e-value <0.01).The Clustal-W multi-alignment of all these FARP sequences together or within each of the seven phyla using default parameters (http://www.ebi.ac.uk/clustalw/) shows that the FARP precursors display sequence similarities within the mature peptide regions, particularly in the area containing the conserved peptide patterns, and that the remaining parts of the precursor sequences display rather low similarities.The FARP peptide precursors also differ from each other by the number of peptide repeat units within the sequences, which is thought to have arisen by unequal crossover events (Lee et al., 1998).In addition, we also observed that most of the mature FARP peptides share common Cterminal sequences but have much mutated N-terminal extensions.All these make it problematic to construct an accurate multiple alignment in order to derive a statistical

Conclusion
Protein domains are highly conserved throughout evolution and there are several databases available that catalogue protein families and domains.Such motif and domain databases are very useful in assigning a putative function to an unknown protein.Peptide precursor proteins are a distinctive class of molecules because they undertake various posttranslational modifications in order to ultimately synthesize stabilized and functional mature peptides, making the annotation of peptides and peptide precursor proteins challenging.This is illustrated by the fact that many metazoan peptides and peptide precursors are not represented by the motifs currently present in the widely used motif database such as PROSITE.Because of the tremendously increasing number of protein sequences and because of the wide range of peptide families, a comprehensive database of conserved patterns typical for endogenously occurring mature peptides is of great value in identifying new peptides and precursor proteins to catch up with their sequencing rate.We therefore have designed a searching procedure to find conserved patterns within the known peptides, and as a result, we have constructed a 'PeptideMotif' database that is representative of most currently known peptide families.Many peptides have been isolated and sequenced as mature peptides and their precursor proteins are often unknown as yet.Therefore, these small peptides are difficult to be identified by other motif databases.Motifs in databases such as Pfam contain two Hidden Marcov Models (HMMs) for each family based on a multiple protein sequence alignment, one built to find complete domains (ls mode) and the other to match fragments of domains (fs mode) (Durbin et al., 1998).These motifs are sensitive at identifying complete domains and thus they can efficiently detect the proteins which have similarities that cover the full length protein sequence or at least contain a complete domain.However, these motifs do not work very well when they encounter short peptides which lack information on amino acids at the sites outside the peptide sequences, or when the conserved regions are limited, especially in distantly related proteins where the overalllength sequence similarity may be not well preserved.In contrast, the patterns derived directly from the mature peptide sequences grasp the highly preserved region of the precursor proteins and thus are able to identify not only the peptide precursor molecules but also the fully processed peptides.Conservative peptide sequence patterns correspond to functionally and structurally important parts of the peptides, i.e. the binding site to specific receptors, the disulphide bonds for stability and tertiary structure.The discovery of peptide motifs will be undoubtedly of great value for any peptide-related studies ranging from the identification of putative peptides and precursor proteins to the annotation of critical functional residues (Husson et al., 2010), to the complement of peptidomic research in detecting and verifying peptides in vitro (Baggerman et al., 2004;Boonen et al., 2008;Menschaert et al., 2010).For example, scanning the peptide patterns against Uniprot revealed 95 proteins (listed in Tables 1 and 2) which are not as yet annotated as putative peptides or precursor proteins.
When determining short functional patterns for peptide sequences, we have to evaluate how representative the peptide motifs are in the 110 characterized peptide families.Short motifs often have some degree of degeneracy and the presence of a motif in a protein may reflect a conserved functional role, a yet to be discovered structural functional role or a nonfunctional role.When using the short currently identified peptide patterns, while the false positives are kept to zero, we observe that 440 (3.8%) of the mature peptides or sequence fragments and 282 (2.5%) of the peptide precursor proteins in these described families cannot be recognized by the peptide patterns.Many of them could be determined by combining the peptide pattern search procedure with the structural hallmarks of bioactive peptides and their precursors (Liu et al., 2006), such as the length of a peptide precursor which is usually not longer than 500 amino acids, the presence of a signal peptide which directs a precursor protein into the secretary pathway of the cell, and the presence of typical cleavage sites flanking the mature peptides.To be even more successful in identifying all false negatives while eliminating all false positives because of the short length and degeneracy of most short motifs, it may be possible to make use of 3D structural patterns when they become available for peptide precursor proteins.Patterns that integrate 3D structural information of the sequences will be more sensitive in identifying peptides and peptide precursors (Gribskov et al., 1988;Taylor et al., 2004).While the majority of known peptide families have been profiled by the established peptide patterns, the remaining ones accounting for in total 251 peptides and precursor proteins (2% of all the proteins in the peptide-precursor database) are not processed by the pattern search procedure.They are from small peptide families, such as eclosion hormones, ecdysistriggering hormones and apelin, which have only a few homologies so far.A pattern based on the small number of peptides usually cannot gain enough confidence in representing the family, and also cannot sufficiently reflect the sequence divergence accumulated in the evolutionary course of the family member.As more peptides and precursor proteins are sequenced, our patterns search procedure can be applied to the corresponding families and the 'PeptideMotif' database will be updated accordingly, keeping the peptide pattern database widely applicable for the identification of critical functional residues and for the annotation of hypothetical molecules in various peptide families.
distantly related proteins from various phyla throughout the evolutionary history of the FARP peptide family.

Table 1 .
The conserved peptide patterns similar to PROSITE signatures.