Databases can be exploited for undertaking transcription factor studies in plants.
Bioinformatics, a computer-assisted science aiming at managing a huge volume of genomic data, is an emerging discipline that combines the power of computers, mathematical algorithms, and statistical concepts to solve multiple genetic/biological puzzles. This science has progressed parallel to the evolution of genome-sequencing tools, for example, the next-generation sequencing technologies, that resulted in arranging and analyzing the genome-sequencing information of large genomes. Synergism of “plant omics” and bioinformatics set a firm foundation for deducing ancestral karyotype of multiple plant families, predicting genes, etc. Second, the huge genomic data can be assembled to acquire maximum information from a voluminous “omics” data. The science of bioinformatics is handicapped due to lack of appropriate computational procedures in assembling sequencing reads of the homologs occurring in complex genomes like cotton (2n = 4x = 52), wheat (2n = 6x = 42), etc., and shortage of multidisciplinary-oriented trained manpower. In addition, the rapid expansion of sequencing data restricts the potential of acquisitioning, storing, distributing, and analyzing the genomic information. In future, inventions of high-tech computational tools and skills together with improved biological expertise would provide better insight into the genomes, and this information would be helpful in sustaining crop productivities on this planet.
- data mining
- comparative genomics
- plant genomes
- sequence analysis
- structure prediction
Sustainability in agriculture systems is largely challenged by a number of factors including human population increase, environmental changes, and tremendous demands for growing crops to produce biofuels worldwide [1, 2]. In this regard, exploring the plant genomes for determining the function of important genes involved in conferring tolerance to biotic and abiotic stresses, followed by exploiting these genes in the development of resilient cultivars, is one of the durable strategies for bringing sustainability in crop yields [2, 3].
After the genome sequencing of Arabidopsis thaliana genome, a project was launched by the National Science Foundation (NSF) for determining the function of 25,000 Arabidopsis genes . Rice was the first genome-sequenced crop (International Rice Genome Sequencing Project 2005) followed by sequencing of a number of genomes of major crops. All these sequencing projects released a large amount of data. For arranging and analyzing these data, a number of bioinformatic tools have been developed, which helped a lot in drawing important biological conclusions, predicting gene functions, etc. Furthermore, development of unconventional mapping populations and online resources of molecular markers  facilitate researchers to identify quantitative trait loci (QTLs). A number of databases have been developed to tackle the newly generated genomic data. These databases provided a foundation to build hypothesis, to design experiments, and to infer knowledge about a particular organism. Moreover, the datasets and “omics” resources of numerous species facilitated the assessment of “omics” properties among species, which further allows studying of conserved genes and evolutionary relationships. Bioinformatics is a crucial tool to access datasets of “omics” and to gather a substantial biological knowledge .
From the sequence analysis to the identification of genes, clustering of associated sequences and study of evolutionary relationships using phylogenetics are major tasks of bioinformatics. It also includes the identification and functional annotation of all genes, proteins, and active sites of protein structure in the cell . At present, with the advancement in NGS tools, a voluminous sequencing data is emerging. For deducing meaningful information from these data, it is important for the science of bioinformatics to coevolve with the genomic tools. In this regard, the main three components including mathematics, computer science, and biology upon which the whole citadel of bioinformatics is based, should evolve in parallel to the sequencing tools. It would pave the way for deducing useful information (phylogenies, syntenic relationship, predicting genes, and their function) from the data in a shortest possible time [6, 7].
Databases are collection of organized data that can be retrieved from a website easily for addressing different queries. For managing and handling a database, different hardware and software programs in a computer are needed. The data are organized in structured records that can cater the easy retrieval of information. Broadly, biological databases are classified into sequence databases, relevant to protein and nucleic acid sequences, and structure databases, only relevant to proteins. The first database was developed after a short period of sequencing the insulin protein in 1956. The “Protein Data Bank” was the first ever biological database developed in 1971. Biological databases have flourished enormously due to availability of huge amount of data being generated every day . The individual laboratories maintained the preliminary databases of protein sequences; later, the creation of a combined formal database called SWISS-PROT protein sequence database was introduced in 1986. Now a plethora of data resources are available for study and research purposes and CDROMs (on request from), which are constantly being updated with the availability of new data .
Biological databases generally offer software tools to analyze the data available on it and to compare new data with already available data. With the help of these computational methods, the laborious and costly “wet lab” work can be avoided. In future, prospects are dealing with some hindrances such as limited awareness of data, complications in data retrieval, availability of limited data analysis tools, and inadequate literature reference accessibility . A number of biological databases are available that can be divided into three categories on the basis of their contents: (1) primary databases—contain raw nucleotide sequences (GenBank, EMBL, and DDBJ), (2) secondary databases—contain highly annotated data (SWISS-PROT and Protein Information Resource), and (3) specialized databases—deal with particular organism and unique data (FlyBase, WormBase, and TAIR). A major problem in interlinking these databases is the lack of format compatibility. This problem is overcome by using a specified language known as Common Object Request Broker Architecture (CORBA) .
At National Center for Biotechnology Information (NCBI), text-based search and retrieval of information can be undertaken by deploying Entrez. It deals with all databases, for example, PubMed, Nucleotide and Protein Sequences, Complete Genomes, etc. In sequence retrieval system (SRS), the Boolean operators are used for undertaking complex searching. It is also used for sequence retrieval, abstract searching, references, etc.
2.1. Dedicated databases for plant genomics
A number of databases deal with datasets focused on particular genes and transcription factors (TFs) related to plant issues and cellular processes. First, a genome-wide finding of repertories of TFs encoded by genes in Arabidopsis genome was described . Accessibility of complete genome sequences in the last few years has enabled us to assemble catalogs of TFs based on their function and association of regulatory systems in different plant species. Numerous databases deliver datasets about genes putatively involved in encoding TFs. These databases are based on predictions made by computational methods (sequence similarity search and hidden Markov model (HMM) conserved DNA-binding domains search). In recent years, GRASSIUS was established to compile resources and tools for undertaking comparative genomics of regulatory sequences in grass species . The Grass Transcription Factor Database (GrassTFDB, another database) of GRASSIUS contains combined sequence information on RiceTFDB, MaizeTFDB, CaneTFDB, and SorghumTFDB. These can be searched through a website. Information of the predicted genes coding TF (carried out by doing annotations across the three genome sequences of legumes) is available on the LegumeTFDB —an extended database of the SoybeanTFDB.
The enhancement of the PGSB PlantsDB database framework has been accomplished with new tools, and sufficient new data have been added into the system particularly for the large complex genomes of wheat, barley, and rye. New resources such as GenomeZipper and CrowsNest for the comparative analysis of data RNASeq Expression Browser have been established. The transPLANT project makes available a platform to compile heterogeneous data about plant genome, for example, integrated searches over multiple databases (Table 1).
|PlantTFDB||http://planttfdb.cbi.pku.edu.cn/ 22||Plant species|
|PlnTFDB||http://plntfdb.bio.uni-potsdam.de/v3.0/ 20||Plant species|
|http://grassius.org/grasstfdb.html||Maize, rice, sorghum, and sugarcane|
|LegumeTFDB||http://legumetfdb.psc.riken.jp/||Soybean, Lotus japonicas, and Medicago truncatula|
|STIFDB||http://caps.ncbs.res.in/stifdb2/||Arabidopsis and rice|
|Barley, wheat, and rye|
3. Analysis of the “omic” data
3.1. Sequence retrieval
First step is the identification and retrieval of sequences from different databases (NCBI, TAIR, Gramene, Rap-db, TIGR, Phytozome, PlantGDB, UniProt and SwissProt) developed for handling protein, DNA, RNA, and Expressed Sequence Tag(EST) sequences (Table 2). Sequence retrieval is not only carried out through query words but it can also be done using BLAST and or their specific accession numbers. To find out similar sequences from databases, BLAST variations according to sequence retrieval could be performed.
|1.||TRANSFAC||http://transfac.gbf.de/TRANSFAC/||Transcription factor database|
|2.||TFD||http://www.tfdg.com/Pages/tfddata.html||Transcription factor database|
|3.||TRRD||http://www.mgs.bionet.nsc.ru/mgs/dbases/trrd4/||Transcription regulatory region|
|4.||PlantCARE||http://sphinx.rug.ac.be:8080/PlantCARE/||Plant cis-acting regulatory elements database|
|5.||PLACE||http://www.dna.affrc.go.jp/htdocs/PLACE/||Plant cis-acting regulatory elements database|
|6.||RegulonDB||http://www.cifn.unam.mx/Computational_Genomics/regulondb/||Database on transcriptional regulation in Escherichia coli|
|7.||SCPD||http://cgsigma.cshl.org/jian||Promoter database of yeast|
|8.||EPD||http://www.epd.isb-sib.ch/||Eukaryotic promoter database|
|9.||PRATT||http://web.expasy.org/pratt||It is an online server tool used to identify pattern of amino acids|
|10.||Phobius||http://phobius.sbc.su.se/||Identification of signal peptides|
|11.||SignalP 4.0||http://www.cbs.dtu.dk/services/SignalP/||Identification of signal peptides|
|12.||TargetP||http://www.cbs.dtu.dk/services/TargetP/||Subcellular localization of sequences|
|13.||LOCTREE3||https://www.rostlab.org/services/loctree3/||Subcellular localization of sequences|
|14.||Wolf PSORT||http://www.omictools.com/wolf-psort-tool||Subcellular localization of sequences|
|15.||Plant-mPLoc||www.csbio.sjtu.edu.cn/bioinf/plant-multi/||Subcellular localization of sequences|
|16.||Cello v2.5||https://bioinformatictools.wordpress.com/tag/cello/||Subcellular localization of sequences|
|17.||PSI-Pred||http://bioinf.cs.ucl.ac.uk/psipred/||Prediction of transmembrane regions|
of the gene
|18.||CIMMiner||http://discover.nci.nih.gov/cimminer/||To explore the expression of a gene or protein on heat map|
|19.||DNASTAR||http://www.dnastar.com/||Making of sequence assembly|
|21.||FoldIndex||http://bioportal.weizmann.ac.il/||It is used to predict folding state|
3.2. Multiple sequence alignment
Multiple sequence alignment (MSA) deals with aligning three or more biological sequences, which may be DNA, RNA, and/or protein. Primarily, its purpose is to study similarity among sequences that can help to assess the evolutionary linkage and their common ancestry. It can be undertaken by many sequence analysis softwares including but not limited to ClustalW online software , ProbCons, and MAFFT . Some other MSA tools are DNAMAN, T-Coffee, M-Coffee, R-Coffee, Expresso, PSI-Coffee, PSAlign, PRRN, MUSCLE, POA, MEME, etc.
A number of algorithms are available to generate MSA of proteins and DNA sequences. The basic approach in producing multiple alignments is to optimize the sum of pairs (SP) score. This approach is practical, and reproduces high-quality MSA dataset . Mathematical approach (also called as probabilistic and stochastic methods) exploits the probability in developing MSA. Hidden Markov model is a masterpiece example of this approach. In this approach, MSA data are modeled as probabilistic models. All possible combination of gaps, mismatches, and matches are assigned with probabilities, and the algorithm finds the most likely MSA . Other approaches are genetic algorithms and simulated annealing, which break a series of possible MSA into segments followed by their rearrangement. It can use an existing MSA and refines it by a series of rearrangements .
3.3. Domain and motif study
Domain always refers to a conserved part of protein sequence and structure, which can evolve, function, and exist independently. Whereas motif is a well-maintained sequence of protein or DNA that remains the same to execute certain function . For characterization of a gene, it is always advisable to study its functional domains and motifs. The novel sequences identified can be subjected for analyzing their domains and motifs to predict their functions. For motif analysis, MEME tool can be used, while for domain analysis PFAM, InterProScan, and SMART tools can be used.
Large protein molecules comprise of structural and functional domains. Structural domains regions are either compact, globular modules, or separated clearly from the flanking regions including membrane regions or long coiled-coil helices that are separating the other domains . These domains can be seen in proteins as semi-independent three-dimensional (3D), and have the ability to fold independently . These domains constitute the “units of evolution”  and have typical functions . Structural Classification of Proteins (SCOP) database has been used extensively for assigning domains in proteins . Most databases and methods (e.g., Class Architecture Topology Homology database) are not fully automated, which combine several other methods for assigning domains to the proteins . Protein Informatics System for Modeling (PrISM) is the only completely automated method that can be used to assign sequence-continuous domains to proteins of known 3D structures . If the structure (3D) of the protein is not known, then a number of alternative methods and databases are available. For example, one of the most prominent databases is putative protein domains (ProDom) .
3.4. Structural analysis
Modeller is used to generate 3D structure . LOMETS server is used to find the best template for comparative modeling. DOPE (discrete optimized protein energy) helps to find best model by calculating each structure’s value that is evaluated through PROSAII  and PROCHECK . To calculate electrostatic surface and solvation properties of complex compounds, APBS  is used. For structure alignment, PDBsum tool  is deployed. Structure of gene can also be displayed on GSDS2.0 (Gene Structure Display Server) . YASARA software is used to draw 3D structure, c-terminal, n-terminal, and domains of proteins . Chromosomal position of genes can be located by NCBI map viewer tool, Mapchar 2.1, and cucumber genome database map viewer tool.
3.5. Analysis of regulatory elements
The regulatory elements encode a protein that binds to promoter or operator region of a gene for up- and/or downregulating its expression. For instance, catabolite activator protein (CAP) is a regulatory element present in prokaryotes, which regulates the lac operon .
Regulation of gene expression takes place at transcription level by specific sequences known as transcription factors—inhibit or initiate the transcription. These factors can be repressors, activators, or both. It is worth mentioning that repressors inhibit the binding of RNA polymerase with the transcription complex (promoters)—thus blocking the transcription. However, activators are activated by the enabling binding of RNA polymerase with the transcription complex.
These elements can be found in silico by deploying PlantCARE , and PLACE program. PLACE is repository of motifs occur cis-acting regulatory DNA elements of plants. This database also gives information about the variations in motifs found in different genes or plant species. Relevant literature and comprehensive description of different motifs can be retrieved from this database. Several research groups have identified a number of genes including WRKY genes, Ascorbate Peroxidase, PSY, etc. using different bioinformatics tools [38–40].
3.6. Mutation identification
Mutation alters the nucleotide sequences of a gene that may change the gene expression. These mutations can be identified using conventional as well as NGS tools [41, 42]. Sequencing of cytosine methylome (methylC-seq), transcriptome (RNA-seq), and small RNA transcriptome (small RNA-seq) in Arabidopsis was undertaken by deploying NGS tools. Genome-scale methylation patterns and a direct relationship between the location of sRNAs and DNA methylation were identified . Protein-protein interactions occur in majority cellular processes. The interactome, representing complete set of all protein-protein connections, is vital for studying the molecular networks . Correlated mutation analysis can be harnessed to predict interface residues. Protein-protein interaction can be studied by detecting correlated mutations at interface .
|1.||MODELLER||http://salilab.org/modeller/||Comparative modeling of protein 3D structures|
|2.||3DJigsaw||http://bmm.cancerresearchuk.org/~3djigsaw/||Predict structure and function of protein|
|3.||ESyPred3D||http://www.unamur.be/sciences/biologie/urbm/bioinfo/easypred/||Homology modeling with increased alignment performance|
|4.||SWISS-MODEL||http://swissmodel.expasy.org/||Automated protein homology modeling server|
|5.||YASARA||http://www.yasara.org/||Molecular modeling tool|
|6.||RaptorX||http://raptorx.uchicago.edu/||Protein structure prediction|
|7.||HHPred||http://toolkit.tuebingen.mpg.de/hhpred||Homology detection and structure prediction server|
|8.||Phyre2||http://www.sbg.bio.ic.ac.uk/~phyre2/html/page.cgi?id=index||3D structure prediction|
|9.||ROSETTA||http://boinc.bakerlab.org/resetta/||3D structure prediction|
|10.||I-TASSER||http://zhanglab.ccmb.med.umich.edu/I-TASSER/||Predict structure and function of protein|
|11.||Bhageerah||http://www.scfbio-iitd.res.in/bhageerath/index.jsp||Energy-based protein structure prediction server|
3.7. Protein structure prediction
It is the prediction of protein from amino acids. Protein structure can be predicted by undertaking similarity searches, MSAs, secondary structure prediction, identification of domains, solvent accessibility predict, itself protein fold recognition, making 3D models, and model validation . For example, small heat shock proteins (smHSPs, largely present in plants) are ubiquitous in nature, and their size is ranged from 17 to 30 kDa. These proteins are encoded by six nuclear gene families. Every gene family encodes a protein that is present in different part of the cell including cytosol, mitochondria, chloroplast, and endoplasmic reticulum. These proteins protect plants from high temperature stress .
3.8. Phylogenetic analysis
Phylogenetic analysis is the study of evolutionary relationships among different organisms. Phylogenetic analysis corresponds to the evolutionary interactions that can be presented in branching form. Phylogenetics refers as cladistics is a set of respective descendants such that it evolves from a respective single ancestor (Figure 2). Cladistics is a specific methodology of theorizing almost every evolutionary interactions . In order to construct a phylogenetic tree, different methods are used that are based on the nature of the data and algorithms used. Each method is based on certain assumptions. Thus, the method used to draw evolutionary relationship on one kind of dataset may not be equally good for the other kind of dataset. It is therefore suggested that a number of distance-based methods [unweighted pair group method arithmetic mean (UPGMA) and neighbor joining (NJ)] and character-based (CB) methods [maximum parsimony (MP), maximum likelihood (ML)] should be run.
3.8.1. Distance-based method
The distance-based method also called as phonetic method depends upon the extent of dissimilarity (the distance) to derive a tree from the two aligned sequences. This method can rebuild the accurate tree if whole genetic divergence proceedings are precisely verified in the sequence. Tree construction is based on the resultant genetic distances from sequenced data, distances from immunological studies, and Euclidean distance applied in various ways .
188.8.131.52. Unweighted pair group method arithmetic mean (UPGMA)
It is the simplest procedure for studying the phylogenetic relationship among different organisms which uses the clustering approach and uncorrected data to make a tree. It joins tree branches based on the criterion of greatest similarity among pairs and averages of joined pairs. UPGMA generates a correct topology with true branch lengths only when the natural mutation is proportional to time (a molecular clock) or approximately equal to raw sequence dissimilarity . However, these conditions are rarely met in practice. Distance matrix is recalculated, and this procedure is continued until the operational taxonomic units [OTU (= neighbors)] are grouped in one cluster. However, this method does not reflect the evolutionary descents.
184.108.40.206. Neighbor joining method (NJ)
This method is usually pragmatic with distance tree making, irrespective of optimization measure. This method works on the principle to discover pairs of OUTs(Operational Taxonomic Units) that curtails the total branch length at respective stage of clustering of OTUs beginning with a star-like tree. Branch length and distance matrix are recalculated until one terminal is found. This method can be used to obtain the branch lengths in addition to the topology of a parsimonious tree speedily . This method is relatively efficient than that of the UPGMA. This method can analyze a large dataset. Construction of one possible tree and also the biased tree are the major drawbacks of this method.
3.8.2. Character-based methods (CB)
These methods are also called cladistic methods that use directly the aligned characters, for instance, DNA or protein sequences, through tree inference. The algorithm based on character takes an aligned set of characters, for example, DNA sequences, and builds a tree relating the changes in discrete characters, desirable to create the observed set of characters. These methods assume that a set of sequences descended from a common ancestor that may change by mutation and/or selection process without involving any kind of hybridization or horizontal gene transfers. Character-based algorithms are comprised of two groups: maximum likelihood and maximum parsimony .
220.127.116.11. Maximum likelihood (ML)
Different statistical tools are exploited to assess hypothesis of evolutionary history. It constructs all possible trees of evolutionary history from a given data. Multiple alignment is done in this method. Probability of all possible topologies for each data partition is estimated to identify a tree with the highest probability at all partitions based on the maximum similar phylogeny. In this method, whole sequence information is used to evaluate all the possible trees. This method cannot handle a large amount of data.
18.104.22.168. Maximum parsimony (MP)
This method uses the philosophy of “the simpler hypothesis is better than the complicated ones” . By this criterion, the MP tree is one with few character-state transformations for all the sequences from a common ancestor. It works by selecting trees that minimize the total tree length. For each site in the alignment, all possible trees are evaluated that is not the characteristic of other methods. This method is less dependent on suppositions about the evolution of sequences than the other strategies to construct a tree. This procedure is handicapped when the data are heterogeneous.
3.8.3. Evaluation of trees
Phylogenetic trees can be statistically evaluated for reliability of branches/clades created using (1) skewness test, (2) bootstrapping analysis, and (3) likelihood ratio tests where all have currently computerized algorithms. Skewness test never has approximation with dependability of specific topology; it is subtle to very small amounts of respective signal contemporary in otherwise random information set. Bootstrapping analysis is a resampling or rechecking tree evaluation methodology that works with distance, likelihood, and parsimony method. The outcome of bootstrap examination is a number related with specific branch in phylogenetic tree giving up the amount of bootstrap duplicates that ropes the monophyly of particular clade. Likelihood ratio tests support the likelihood ratio (tests) that is easily applicable to ML (maximum likelihood) examination. Value of likelihood is calculated for implication against normal circulation of fault in optimal models .
3.8.4. Software mostly used for phylogenetic analysis
Phylogeny inference package (PHYLIP)  contains 30 programs that cover the main flows of phylogenetic analysis. It is a freely available software and is accessible to almost all kinds of computer platforms (Mac, UNIX, DOC, etc.). In addition, phylogenetic analysis using parsimony (PAUP) software is widely used to infer and interpret the evolutionary tree. Now the old version has been upgraded (PAUP*) after the inclusion of maximum likelihood and distance methods. Other than those described above, some phylogenetic programs have unique proficiencies but mostly inadequate in their respective actions, and movability. These include molecular phylogenetics (MOLPHY) , TREE-PUZZLE , FastDNAml , and MACCLADE .
3.9. Molecular dynamics simulations for plant molecules
Molecular dynamics simulations are the principal methods for elaborating the physical foundation of structure, function, and interaction of biological macromolecules (e.g., proteins and nucleic acids). Earlier, proteins were considered as comparatively rigid structures that now have been changed by a dynamic model in which the internal movements and conformational changes are key players in determining their functions. Computer simulations are carried out in comprehending the characteristics and arrangements of different molecules related to physical structure and interactions, otherwise not possible to observe by other means. There are two major classes of simulation techniques, i.e., molecular dynamics and Monte Carlo. These simulations have been used extensively in characterizing plant compounds (natural distillates) followed by finding optical counter parts with identical efficiency .
3.10. Proteomics and transcriptomics
Study of proteins along with mRNA transcripts is referred as proteomics and transcriptomics, respectively . Due to intrinsic complexity, experimental workflows and variety of data types, storage, and open depository of proteomics data based on mass spectrometry (MS) are still insufficiently established. Many public sources with particular purposes for MS proteomics research have been established to fulfill this need. These databases are Global Proteome Machine Database (GPMDB), PRIDE, PeptideAtlas, ProteomicsDB, Mass Spectrometry Interactive Virtual Environment (MassIVE), PeptideAtlas SRM Experiment Library (PASSEL), etc. Moreover, for the purpose of enhanced integration and harmonized sharing of public warehouses, the ProteomeXchange consortium has been developed recently to capitalize on its advantage for the scientific community .
For transcriptomics studies, there are numerous databases comprising microarray data: NASCArrays, ArrayExpress, Genevestigator, Stanford Microarray Database, and the Gene Expression Omnibus, which are freely available . An example of the transcriptome database is Chickpea Transcriptome Database (CTDB), which has information about the tools used for transcriptome sequence, conserved domain(s), molecular markers, transcription factor families, and complete gene expression information .
3.11. Protein-protein interactions
The protein-protein interactions (PPIs) control the expensive scope of biological procedures that include interactions between cells, metabolic as well as developmental pathways. This noncovalent bonding brings a range of interactions and associations between proteins. PPIs can be classified in several ways depending upon their contrasting structural and functional characteristics . There are several in vivo and in vitro methods for finding PPIs but our focus is on computational approaches. Computer modeling assisted with mathematical methods facilitates the study of different processes . In silico methods combining the computational modeling are being used to study protein interactions. The in silico analysis integrates multiple data types including gene coexpression, colocalization, functional category, and the occurrence of orthologs or interologs to derive a global network in a species . A list of webservers can be used to predict protein-protein interaction (Table 4).
|S. No.||Web server||Description||URL|
|1.||Coev2Net||Coev2Net is a general framework to predict, assess, and boost confidence in individual interactions inferred|
from a high-throughput experiment
|2.||InterPreTS||InterPreTS uses tertiary structure to predict interactions||http://gabrmn.uab.es/interpret/|
|3.||PrePPI||PrePPI predicts protein interactions using both structural and nonstructural information||http://technology.sbkb.org/portal/page/350/|
|4.||iWARP||iWARP is a threading-based method to predict protein interaction from protein sequences||http://groups.csail.mit.edu/cb/iwrap/|
|5.||PoiNet||PoiNet provides PPI filtering and network topology from different databases||http://poinet.bioinformatics.tw/|
|6.||PreSPI||PreSPI predicts protein interactions using a combination of domains||http://code.google.com/p/prespi/|
|7.||PIPE2||PIPE2 queries the protein interactions between two proteins based on specificity and sensitivity||http://cgmlab.carleton.ca/PIPE2|
|8.||HomoMINT||HomoMINT predicts interaction in human based on ortholog information in model organisms||http://mint.bio.uniroma2.it/HomoMINT|
|9.||SPPS||SPPS searches protein partners of a source protein in|
|10.||InPrePPI||InPrePPI predicts protein interactions in prokaryotes based on genomic context||http://inpreppi.biosino.org/InPrePPI/index.jsp|
|11.||STRING||STRING database includes protein interactions|
containing both physical and functional associations
|12.||MirrorTree||The MirrorTree allows graphical and interactive study of the coevolution of two protein families and assess their interactions in a taxonomic context||http://csbg.cnb.csic.es/mtserver/|
|13.||TSEMA||TSEMA predicts the interaction between two families of proteins based on Monte Carlo approach||http://tsema.bioinfo.cnio.es/|
|14.||COG||COG shows phylogenetic classification of proteins encoded in genomes||http://www.ncbi.nlm.nih.gov/COG/|
3.11.1. Arabidopsis protein interaction analysis
More than 10 freely accessible protein interaction databases are available for A. thaliana. An intelligent bioinformatics web device, ANAP (Arabidopsis Network Analysis Pipeline) has been created for incorporating Arabidopsis protein collaboration databases. A total of 11 Arabidopsis protein collaboration databases having 201,699 protein association sets, 15,208 identifiers, 89 connection discovery routines, 73 species that interface with Arabidopsis, and 6161 references were incorporated in ANAP .
3.11.2. Computational identification of protein-protein interactions in rice
Complexity of plant molecules always hinders progress toward exploring the protein-protein interaction networks on large scale. A total of 5049 proteins with 76,585 interactions were predicted in rice using Predicted Rice Interactome Network (PRIN). The prolonged molecular network in PRIN has greatly improved the ability to analyze the function and organization of genes and gene networks .
3.11.3. iPlants: the world’s plant online
This database has been designed to develop a comprehensive working list of scientific names of all plant species. Through this database, authenticated names of plant species (agreed by the scientific community) with their alternative synonyms can be found. This type of list will empower untrained botanists to get useful information about different plant species. iPlants will also resolve the existing confusions found in the published taxonomies. A total of 422,000 known plant species and 1,500,000–1,700,000 scientific names are used to refer these plant species are present in this database.
This database will help in exploiting plant biodiversity information in different breeding as well as gene cloning programs .
Reactome database provides access without any restriction about the peer-reviewed pathways . This database is equipped with bioinformatics tools, which can be used to examine, visualize, interpret, and analyze knowledge about pathway. The information in this database is generated by the experts (curators and software developers) and cross-referenced to other databases, for example, NCBI, Ensembl, UniProt, UCSC Genome Browser, HapMap, KEGG, ChEBI, PubMed, and GO. In this database, orthologous reaction for over 20 nonhuman species including rice, Arabidopsis, and Escherichia coli can be found. This database can be accessed in the form of online text book . Biological pathways and reaction can be viewed in a number of formats, comprising of PDF, SBML, and BioPax . Recent version “v55” of Reactome was released in December 2015.
Study of all or utmost metabolites in an organism are denoted as metabolomics. It is a complex research field that involves interdisciplinary interaction of different sciences. One of the numerous methods is soft independent modeling of class analogy (SIMCA). Besides this, an effective protocol for data mining in metabolomics has also been developed . In recent years, numerous databases containing data about compound names and structures, mass spectra, metabolic pathways, metabolite profile, and statistical/mathematical models are established. These databases are extremely useful for metabolomics research .
The MeRy-B (http://bit.ly/meryb) is dedicated to plants, and it provides information related to metabolites detected using NMR(Nuclear magnetic resonance), together with related analytical and experimental metadata. MeRy-B is equipped with a list of many plant metabolites along with the data of their experimental conditions, the features studied, and concentration of metabolites of 19 different species including the model plant species such as Arabidopsis .
4. Implications of bioinformatics in plant omics
Bioinformatics is an essential part of omics providing techniques to analyze large biological data sets and interpreting them into applications of “omic”. Tools dealing with “omics” generate massive data that assist system biology to combine multivariate information into systems and models. The omics tools including high-throughput genome-scale genotyping platforms such as whole-genome resequencing, proteomics, and metabolomics offer better prospects for gene identification and exploration of molecular mechanisms. This information can be used to develop ideal genotypes suitable for varying climatic conditions .
4.1. Plant genome sequencing
With the advancements in high-throughput techniques, whole-genome characterization of a wide range of organisms has been possible. Nevertheless, the storage and management of this massive genomic data is a major challenge. Revolution in sequencing technologies has made it possible to sequence large and complex genomes at extremely low cost and in much less time period. Presently, the most popular methods of genome sequencing are shotgun sequencing and NGS. The NGS is very popular tool for the identification of housekeeping genes in crop plants. Many tools such as Genome Analyzer, the Applied Biosystems SOLiD System, Roche/454 FLX, and the Illumina/Solexa are commercially available for NGS . NGS can be utilized for whole-genome sequencing, isolation of transcription factor binding sites, and expression of noncoding RNA and targeted resequencing . Various software packages are available to assemble sequences, for example, Phred/Phrap/Consed , GAP4 , and chromaseq . Another software called AMOS was developed by TIGR, which is useful for comparative genome assemblage .
4.2. Plant whole-genome resequencing
The most effective approach in functional genomics is the whole-genome resequencing. For reducing cost, target region can be sequenced. Microarray is also a common way of target region sequencing, which is based on hybridization to arrays comprising of synthetic oligo-nucleotides that match the target DNA sequence . Recent NGS technology has made it possible to discover differences between individuals and populations especially of the crop species whose genomes have already been sequenced and assembled. Similar projects in Arabidopsis  and rice  generated a huge data of natural variations occurring within different accessions.
4.3. Plant comparative genomics and databases
Using comparative genomic approaches, functions to different genes (especially representing the less studied species) have been assigned. The developments in RNA interference and other technologies like mutagenesis have allowed phenotypic screens for genes—known as phenomics . The field of phenomics is heavily dependent upon the interaction of plant genome with the prevailing environments. This science is largely dependent upon intensive collaboration between three disciplines including plant science, computer science, and engineering. Currently, there are yearly plant-focused image-processing tasks  that have positively stimulated the community and invigorated computer scientists to focus on developing joint plant datasets. Though there is limited accessibility to high-throughput phenotyping platforms. A current list of accessible image datasets can be accessed at the website .
4.4. Important information source of plant species
The most prevalent and unified information collection source is TAIR that maintains data of molecular biology, genetic and genomic of Arabidopsis . Similarly, Salk Institute Genomic Analysis Laboratory (SIGnAL) deals with the omics research of Arabidopsis.
Gramene is an integrated source of information for grasses. It exploits the rice-genome-sequencing information as a foundation source for comparing the information of other members of grass family . At this website, information about DNA and mRNA sequences, genome assembly and annotations, genes, genetic maps and physical maps, QTLs, and many more are available. These interesting features make this website more attractive for researchers, and it is being updated regularly with new attributes like genetic diversity data, comparison of genomes of Oryza sativa with its wild relatives or with the other taxa for undertaking evolutionary studies, etc. .
The portal site SoyBase  provides information about whole-genome sequence data. The portal site for Solanaceae genome is the Sol genomics network. It also provides information about the tomato-genome-sequencing project . The MaizeGDB is a public database for Zea mays . GreenPhylDB is a broad platform intended for facilitation of comparative functional genomics in O. sativa and A. thaliana genomes . PLAZA 3.0 has been established to develop comparative genomics data of plants accessible via user-friendly web interface. Structural and functional annotation, phylogenetic trees protein domains, gene families, and detailed data about genome organization can simply be inquired and envisioned . A comparative genomics database named PIECE was established to accommodate information pertaining to gene structure comparisons and evolution. This database covers all the annotated genes mined from 25 plant species .
4.5. Use of bioinformatics for comparative genomics in plants
Availability of whole-genome sequences and bioinformatics have accelerated the process for identifying specific gene families in different plant species. These tools were also used to study the duplications as well as deletions in different plant genomes . These results are helpful in phylogenetic studies , study of synteny and collinearity relationship, and inference of shared ancestry of genes . The plant genome duplication database (PGDD) provides important data for studying the syntenic relationships of intragenome or cross-genome identified in the genome-sequenced species . Analysis of orthologous clusters at genome level is a significant element in elucidating comparative genomics. Recognizing overlap between orthologous clusters can permit us to clarify the utility and evolution of proteins among multiple species. OrthoVenn is a web platform that is freely accessible and can be used for making comparisons and annotations of orthologous gene clusters. It can be accessed at . Information regarding orthologs of plants and green algae can be searched at PlantOrDB .
4.6. Gene prediction and genome annotation
The characterization of introns and exons in a sequenced genome is referred as gene prediction. These predictions can be undertaken computationally or combination of manual as well as computational annotations. Numerous computer programs to find protein-coding genes are accessible through OMIC TOOLs website , which has been extensively used for genome annotations and genes prediction.
For structural annotations of a genome, a number of software packages were described [102, 103]. Additionally, tools (SynBrowse and VISTA) of genome comparison can be used to improve precision of gene identification. Repeat-Masker  was designed to find interspersed repeats and low complexity sequences in whole sequenced genome. Through this program, the repetitive sequences can also be masked. Similarly, a number of software programs (Repeat Finder, RECON, etc.) are available that can be used to find repeats in a sequenced genome.
4.7. Genome mapping and bioinformatics
Selecting suitable mapping tool and sequences search may claim adjustments in specificity and sensitivity of the search statistics. The process of finding candidate genes conferring traits can be accelerated for those crops where genetic and physical maps and annotated genome assemblies are available. A wide range of tools have been developed recently for illustrating maps and imagining genomes primarily to facilitate genome assembly.
NCBI is a source to assess all types of information regarding genomes. Access to various biological databases is possible using “Entrez.” For aligned genetic, physical, and sequence information of eukaryotes including plants, a genome browser “Map viewer” has been developed. To display aligned map from various species entered in Map Viewer, a special plant query page can be accessed. Customized plant basic local alignment search tool (Plant BLAST) facilitates the process of exploring sequence similarity from the collection of mapped plants sequence data, and the resulting alignment can be visualized in genomic text using “Map viewer” , R/qtl , JoinMap , OneMap , MSTMap , Lep-MAP , and HighMap , which can be used to develop genetic linkage maps .
Numerous databases offer data for exploring markers in multiple crop species. DNA markers including Single nucleotide polymorphism (SNP), Simple sequence repeat (SSR), and conserved ortholog set (COS) markers can be predicted using PlantMarkers . A famous site for Triticeae genome is GrainGenes that contains information about linkage maps and DNA markers of wheat, rye, barley, and oat . Gramene, a database for comparative genomics, contains genetic maps of multiple plant species . The Triticeae Mapped EST database (TriMEDB) gives information of mapped cDNA markers related to barley and wheat . The CottonGen web-based database provides information and open access to genetic, genomic, and breeding data of cotton. CottonGen has improved tools for sharing, mining, retrieval, and visualization of data as compared with the CottonDB and Cotton Marker Database .
In this chapter, we described comprehensively the available resources and tools of bioinformatics pertaining to gene expression, databases, protein, and metabolite analyses and genome sequencing. Bioinformatics has been evolved rapidly over the last 15 years—emerged as a new key discipline of biology. A huge amount of genetic and genomic data have been generated using next-generation sequencing technologies that provide opportunities for generating huge genetic and genomic data. However, drawing useful genetic information is handicapped due to unavailability of skilled bioinformaticians. Still, there is room for some unsolved problems in bioinformatics like computerized data mining, vigorous inference of phenotype from genotype, trainings of students and recognized researchers in bioinformatics, etc. Bioinformatics is generating job opportunities for brilliant and skilled researchers in biology, statistics, and computer science. The remarkable evolution of bioinformatics has been confronted by a number of troublesome revolutions in science and technology. Even though, bioinformatics has developed possibly itself to a level above recognition. Today’s bioinformatics is a luxury to biological scientists, generating huge data in all fields of biological sciences. In near future, bioinformatics will be an indispensable part of plant research, and novel tools and methods will be incorporated by every plant scientist. The next half century is the era of “data integration.” Both basic and applied research will replenish the society for renewable energy, dropping world hunger and poverty, and protecting the environment.