Computational tools in the study of the transcriptome.
This chapter deals with the topic of bioinformatics, computational, mathematics, and statistics tools applied to biology, essential for the analysis and characterization of biological molecules, in particular proteins, which play an important role in all cellular and evolutionary processes of the organisms. In recent decades, with the next generation sequencing technologies and bioinformatics, it has facilitated the collection and analysis of a large amount of genomic, transcriptomic, proteomic, and metabolomic data from different organisms that have allowed predictions on the regulation of expression, transcription, translation, structure, and mechanisms of action of proteins as well as homology, mutations, and evolutionary processes that generate structural and functional changes over time. Although the information in the databases is greater every day, all bioinformatics tools continue to be constantly modified to improve performance that leads to more accurate predictions regarding protein functionality, which is why bioinformatics research remains a great challenge.
- computational biology
- functional genomics
The study to understand the functioning of the cell, as well as the molecules and processes that are carried out within it, originated the use of various disciplines and sciences to facilitate the progress in research for its characterization over time. In the 1950s, the sequencing of small biological molecules began, and in 1956, the sequencing of the first protein was achieved. Thus, Margaret O. Dyhoff determined that bovine insulin is a small peptide of 51 amino acids. With these advances and the constant production of biological information, there was a need to collect and organize all the information generated from these sequencing projects . In 1965, the first biological sequence database was created, in which all the DNA and protein sequences described up to that time were stored and made available to the scientific community. Eight years later, the oldest known database was created, which is still in force today, Protein Data Bank (PDB) .
In the 80s, bioinformatics had already gained a new meaning in scientific research, so several research groups such as Theoretical Biology and Biophysics Group attached to the American Institute The Alamos National Laboratory, together with Stanford University, gave rise to the best-known database in the world called GenBank. Almost at the same time, in 1981, Temple Smith and Michael Waterman extensively reviewed the mathematical algorithms for comparing biological sequences. As a result of their analysis, they generated the well-known local alignment algorithm that allowed to optimize the comparison of biological sequences, being the most important contribution for the direct comparison of sequences and cornerstone of the alignment by sequence pair .
A few years after the creation of GenBank, its European and Asian versions were generated, known as the EMBL database (European Molecular Biology Laboratory) and DDBJ (DNA Data Bank of Japan) in 1981 and 1984, respectively. In 1985 the FASTA algorithm (FAST-AII) of sequence comparisons was reported, which operated as a search engine for similar sequences within the GenBank . During the years from 1987 to 1990, databases for protein sequences were propelled which resulted in the creation of Swiss-Prot and PIR (Protein Information Resource). In 1990, another of the most important milestones in bioinformatics originated the BLAST algorithm (Basic Local Alignment Tool) that completely revolutionized the exploration and search of biological sequences in databases .
The National Center for Biotechnology Information (NCBI) makes the following definition:
Bioinformatics is a field of science in which various disciplines such as applied mathematics, statistics, artificial intelligence, chemistry, biochemistry, computing and information technology converge, whose objective is to facilitate the discovery of new biological ideas, as well as create global perspectives from which unifying principles in biology can be discerned .
It consists of two complementary subfields with each other:
The development of computer tools and databases.
The application of these in the generation of biological knowledge to better understand living systems .
According to the National Institute of Health of the United States, bioinformatics or also called computational biology, deals with the development and application of analytical data and theoretical methods, mathematical modeling and computer simulation techniques to study biological, behavioral and social systems . The programs use public or private databases (with restricted access or with economic value) that have been created with information that is constantly growing and managed by institutions from various sectors. The main databases used in computational biology are described below:
1.1 Biological databases
Primary databases contain original biological data. They are raw sequence files or structural data (for example, GenBankm y Protein Data Bank) .
Secondary databases contain information processed computationally based on primary data. Translated protein sequence databases contain the functional annotation belonging to this category (for example, Swiss-Prot and PIR) .
Specialized databases are those that serve a particular research interest (for example, Flybase). The HIV sequence database and Ribosomal Database Project are examples of databases that specialize in a particular organism or a certain type of data. Many of the problems detected in scientific research lie in the need to connect secondary and specialized databases to primary databases. It is desirable that entries in a database be cross-referenced or linked to related entries in other databases that contain additional information .
There are primary databases, which contain direct information on the sequence, structure or pattern of DNA or protein expression, and secondary, which contains data derived from primary databases, such as mutations, evolutionary relationships, grouping by families or by functions, involvement in diseases, etc.
1.2 Databases for protein analysis (amino acid sequence databases)
Swiss-Prot: It contains annotated or commented sequences, that is, each sequence has been reviewed, documented and linked to other databases. External link: Swiss-Prot in the EBI (
TrEMBL: Translation of EMBL Nucleotide Sequence Database includes the translation of all coding sequences derived from (EMBL-BANK) and which have not yet been annotated in Swiss-Prot. External link: TrEMBL (
PIR: Protein Information Resource is divided into four sub-bases that have a decreasing annotation level. External link: PIR (
ENZYME: It links the complete enzyme activity classification to the Swiss-Prot sequences. External link: ENZYME (
PROSITE: It contains information on the secondary structure of proteins, families, domains, etc. External link: PROSITE (
INTERPRO: It integrates information from various secondary structure databases such as PROSITE, providing links to other databases and more extensive information. External link: INTERPRO (
PDB: Protein Data Bank is the 3-D tertiary structure database of proteins that have been crystallized. External link: PDB (
1.3 Data warehouse
A Data Warehouse (DW) is a set of integrated data oriented to a subject, which vary over time and are not transitory, which support the decision-making process of the administration . From the review of the bioinformatics projects it is found that the requirements of this field require the storage of large volumes of data, with multiple dimensions, of extended periods of time and with heterogeneous formats as well as their sources. For example, Ligand Depot is an integrated data source for finding information about small molecules, proteins and nucleic acids. It focuses on providing chemical and structural information for small molecules. Accepts keyword-based queries, also provides a graphical interface for conducting chemical substructure searches, and allows access to a wide variety of web resources .
1.4 Data mining in bioinformatics
Data mining is oriented towards the study of techniques to extract valuable information from a large amount of biological data. For this, efficient software tools are necessary to recover data, compare biological sequences, discover patterns and visualize the discovery of knowledge .
Among the most common data mining techniques in bioinformatics can be highlighted :
KDD is the complete process of extracting knowledge, not trivial, previously unknown and potentially useful from a data set.
KDT is oriented to the extraction of knowledge from data (unstructured in natural language) stored in textual databases, is identified with the discovery of knowledge in the texts.
1.5 Applications of bioinformatics
The areas in which bioinformatics is currently developed are many and varied, ranging from simple tasks such as direct acquisition of data from DNA or protein sequencing assays (when techniques such as mass spectrophotometry are used), until the development of software for the storage and analysis of the data, which implies in many cases, the generation of algorithms that require both mathematical and biological knowledge. Within the areas in which bioinformatics takes place are genomics, proteomics, pharmacogenetics and phylogeny. The plant genome databases and gene expression analysis of this profile have played an important role in the development of new crop varieties that have higher productivity and more disease resistance .
Specifically, bioinformatics encompasses the development of databases or knowledge to store and retrieve biological data, algorithms to analyze and determine their relationships with biological data, and the statistical tools to identify and interpret data sets. The following describes in detail what refers to metabolomics, transcriptomics, proteomics, comparative genomics, functional genomics, phylogeny and protein modeling.
2. Metabolomic data analysis
The metabolomics was originally proposed as a tool of functional genomics, but its use has been extended much more, as it has had great advances like other omics sciences, such as transcriptomics and proteomics; because the metabolomic work is determined by physical-chemical characteristics of organic molecules unlike the genes, mRNA and proteins that come from a specific sequence, so the success of the characterization of these biopolymers is thanks to bioinformatics technology and tools that help sequence characterization . Its objective is to detect, quantify and interpret the overall analysis of all metabolites; these studies are used in various areas and, like proteomics, one of its main contributions is biomarkers, helping to identify metabolites that are correlated with diseases and environmental exposures . Metabolites are chemical entities that do not come from a transfer of information within the cell, coupled with this, they are also characterized by being diverse as they are substrates and metabolism products that drive essential cellular functions, such as energy production and storage, signal transduction and cell apoptosis; in this great diversity of chemical structures we find endogenous and exogenous metabolites, the former are produced naturally by an organism and the latter come from interaction with the outside. The great diversity of molecules reflects in a wide range of polarities, molecular weights, functional groups, stability and chemical reactivity, etc. [12, 13].
Among the first reports of metabolite detection are those where mass spectrometry (MS) was used to separate a wide range of metabolites present in urine and tissue extracts . In addition, multicomponent analyzes were described to obtain the metabolic profile for three types of urinary constituents: steroids, acids, drugs and drug metabolism . On the other hand, there are reports where physical, chemical or psychological changes can cause biological responses such as oxidative stress and inflammation; among the biomarkers that are the result of a chemical reaction are lipoperoxides or oxidized proteins that are the result of the reaction of molecules with reactive oxygen species (ROS) and those that represent the biological response to stress, such as the transcription factor NRF2 or inflammation and inflammatory cytokines . Among the best known and clinically used examples we find glucose as a marker of diabetes  and phenylalanine as a marker of congenital metabolic disorder .
Because metabolites play important roles in the biological pathways; its differential flow or regulation can reveal new knowledge about diseases and environmental influences, so one of the most important objectives of the metabolic analysis has been to assign the identity of the metabolite within a metabolic pathway [19, 20]; generating a large amount of data; requiring for its processing an arduous mathematical, statistical and bioinformatic work [12, 21, 22], this last area is crucial for the development of metabolomics as it helps in the handling of data and information, analytical data processing, metabolomic standards, ontology, statistical analysis, mining and data integration, and mathematical modeling of metabolomic networks with antecedents of biological systems , it is also necessary to decide which metabolites are biologically more significant. This can be achieved by helping the identification process, reducing the redundancy of characteristics, presenting better candidates for the MS, accelerating or automating the workflow, recovering data through characteristics through meta-analysis or multigroup analysis, or using stable isotopes and mapping of pathways. For all the above, in recent years, the technologies for analyzing metabolites have undergone improvements, establishing more efficient protocols for experimental design, as well as better sample extraction techniques and data acquisition that have been worthwhile in providing sets of complex and solid data .
The database management system for metabolomics requires the collection of raw and processed metadata, some important aspects for comparing data and obtaining results in different laboratories and reproducing experimental conditions are: The nature and treatment of samples prior to study. Among the bases and tools for the analysis and visualization of available data are: Kyoto Encyclopedia of Genes and Genomes (KEGG;
3. Transcriptome data analysis
The genes response to intracellular or extracellular stimuli includes a hierarchy of signals that allows genes encoded in the DNA to be expressed or repressed by the transcription process. The total set of transcripts (RNA molecules) produced by a cell under a given condition and time, is defined as a transcriptome . Unlike the genome, the transcriptome is highly dynamic and actively changes as a consequence of factors that influence the stage of development of organisms, as well as the surrounding environmental conditions. In this sense, transcriptomics is an essential tool to interpret the functional elements of the genome, having as object of study, all species of transcripts, messenger RNA, non-coding RNA and small RNAs . Its main purpose being to determine transcriptional structure of genes, that is, where a gene begins and ends (start sites 5′ and 3′ end), posttranscriptional modifications, splicing patterns and differential expression analysis . The RNA molecules synthesized by a cell have a specific function in a given cellular process, the transcripts include: (a) messenger RNA (mRNA) that is the intermediary between the gene information and the proteome. In this way, the amount of mRNA molecules makes it possible to elucidate expression patterns and in turn correlate the abundance of mRNA molecules with changes in protein abundance ; (b) non-coding RNA (cRNA) that is responsible for the regulation of gene expression . Determining where, how and when a transcript is generated is essential to know the biological activity of a gene . Analyzing the transcripts that coexist at any given time gives us global information on the cellular state under a certain condition, which has allowed us to establish patterns of gene regulation coordinated with the consequent identification of promoter elements common to several genes .
3.1 RNA study technologies and tools in bioinformatic analysis
The RNA study approach has changed from the sequencing of the first determined RNA molecule, to the sequencing of the transcriptome using new generation technologies . Northern blot is a technique based on hybridization and radioactive labeling, cDNA microarrays (complementary DNA obtained from mRNA) and cDNA-AFLP tools widely used in studies of expression levels and serial analysis of gene expression (SAGE), at the time they provided relevant information, being Microarrays widely used today [31, 32, 33, 34, 35]. However, these techniques require prior knowledge of the genome, have low coverage and are based on hybridization, in this sense the abundance of transcripts is inferred by the intensity of hybridization and the results obtained are noisy, which directly interferes with the reproducibility of the results, besides being insufficient techniques to detect new transcripts .
The growing importance of DNA sequencing in model organisms, as well as in the quest to understand the dogma of biology, the NGS technologies (Next Generation Sequencing) arise, which have high yields in the treatment of the sample, are reproducible and highly reliable, as well as accessible and economical, to the point of being more profitable than sequencing by SANGER. These next-generation technologies are based on sequencing by synthesis (SBS) known as pyrosequencing, the transcriptomic variant of pyrosequencing technology is known as short-reading massive parallel sequencing (RNA-seq). The availability of this technology has revolutionized the approach of transcriptome study, having commercially available Roche/454; Applied Biosystems SOLID; HeliScope e Illumina .
From the first RNA studies based on sequencing by SANGER to NGS technologies, bioinformatics has been a key tool in the analysis process. Initially the differential expression based on the analysis by Microarrays presented its own computational challenges , currently while the reads are shorter than those created by sequencing by SANGER, NGS has a higher performance and generates data set of up to 50 gigabases per run , this requires algorithms capable of processing this amount of data in the shortest time possible and with a high degree of reliability.
The study of the transcriptome by RNAseq involves different stages ranging from RNA extraction, library construction, sequencing and data analysis. In this last step four main stages are distinguished (a) Quality analysis of the reads, this allows to determine possible problems in the reads. FastQC is a next-generation data quality control tool, which reports graphs and tables providing quality information based on the reads (per base sequence quality); check the quality of subsets of reads (per sequence quality scores); it also shows the proportion of each nucleotide base of the DNA in each base of the reads (per base sequence content); presents the average GC content in the reads and compares that content with the normal distribution (per sequence GC content); shows the proportion of N, that is, unknown nucleotide observed in each reading position (per base N content); shows the size distribution of reads (sequence length distribution); detects adapters in the reads (adapter content); detects possible sequencing problems introduced in the reads after the adapter (k-mer content)
|Quality analysis of reads||FastQC|
Fastx-toolkit Trimmomatic, PRINSEq, Flexbar
|It analyzes the quality of the reads|
It debugs poor quality reads
|[38, 39, 40, 41, 42]|
|Assembly||Trinity, Trans-ABySS, Oases, IDBA-Tran|
TOPHAT, STAR, IDBA-Tran, HISAT
|Assembly of reads without genome or reference transcriptome|
Assembly of reads with genome or reference transcriptome
|[46, 55, 56]|
[43, 44, 57, 58]
|Classification of transcripts||BLAST, BLAT, GMAT, AUGUSTUS|
CPAT, FEELnc, NRC, lncRScan-SVM
|It identifies coding transcripts by homology or by known transcript characteristics||[5, 59, 60, 61]|
[62, 63, 64, 65]
|Mapping||TOPHAT, STAR, HISAT, HISAT2, Bowtie||It aligns reads with a reference genome or transcriptome||[43, 44, 58, 66]|
|Quantification||RSEM, Feature Count StringTie, Salmon, Kallisto||It estimates the number of transcripts with or without their alignment||[45, 49, 67, 68, 69]|
|Classification of coding and non-coding transcripts||BEDTools, glbase||It determines the coordinates of the reference genome|||
|BLAST, BLAT, GMAP, AUGUSTUS||Through homology it manages to determine known sequences of transcripts found in databases||[5, 59, 60, 61]|
|CPAT, FELLnc, lncRScan-SVM, NRC||It evaluates characteristics of coding and non-coding transcripts||[62, 63, 64, 65]|
|Small RNA analysis||miRDeep|
|It quantifies known micro RNAs and identify new RNAs||[71, 72, 73]|
|PiPMir||It identifies new micros RNAs in plants|||
|DARIO||It allows the recognition of micro RNAs, snoRNA and tRNA|||
|IntaRNA||It analyzes micro RNAs in eukaryotes and small bacterial RNAs||[75, 76]|
|CopraRNA||It makes comparative predictions that include functional enrichment analysis||[76, 77]|
3.2 Bioinformatics tools in the study of coding RNA, non-coding RNA and microRNAs
The identification of non-coding RNAs and small RNAs is a vital issue in genetic analysis , in this sense algorithms have been developed for the analysis of this type of RNAs in particular (Table 1). Currently, the tools used to classify coding and non-coding sequences have two aspects, those that classify transcripts according to similarity and those that use known coding and non-coding properties . Similarity-based tools classify transcripts, taking as reference the amino acid sequences of their transcripts translated with known protein coding genes, for example BLAST , BLATS , GMAP . On the other hand, tools focused on coding and non-coding characteristics are based on the properties of known transcripts to predict whether a transcript encodes or not for a protein. The coding potential can be estimated using automatic learning approaches such as CPAT , FEELnc , lncRScan-SVM  and NRC . These exclude transcripts based on properties such as transcription length, length of open reading frame (ORF), ORF coverage, k-mer frequency, codon usage bias, in addition to being optimized for different techniques . In the choice of the tool to be used to evaluate the coding potential of a transcript, it will depend on what is sought in the study, if there is a good annotation and reference genome the tools based on similarity are practical and feasible in the analysis. However, in organisms that lack good gene annotations it is advisable to use tools based on coding and non-coding characteristics, which also allow to identify new genes. On the other hand, the availability of small readings opened a new field of study for small RNAs such as microRNAs (miRNAs), small RNAs of interference (siRNA) and piwiRNAs (piRNAs); Currently there are specialized tools for this type of RNA that provide additional biological knowledge. In this case miRDeep and its varieties are widely used to quantify known and novel RNA (miRNA), from the sequencing of small RNA by RNAseq [71, 72]; PiPMir  has been used for the detection of miRNA in plants. DARIO (
4. Proteomics data analysis
Transcriptome sequences provide resources for gene expression profile studies, as well as for the identification of mutations, sequence aberrations and RNA editing events , the above is possible to the existence of the open reading frame (ORF), however, in genomic data this does not imply the existence of a functional gene; despite the great advances in bioinformatics that facilitate the analysis and prediction of genes with the help of comparative genomics, and although they are years of development of molecular simulation methods, attempts to improve models that are already relatively close to the structure native, they have had little success, which may be due to inaccuracies in the potential functions used in simulations, such as the treatment of electrostatic and solvation effects or it may be necessary to improve sampling strategies due to the relatively long folding time scale of proteins; the combination of chemistry and physics with the large amount of information in known protein structures could provide a better route for the development of enhanced potential functions. Currently, it is difficult to accurately predict protein structures from genes, the success rate for the correct prediction of structures remains low [25, 80, 81]. Proteomics involves various technologies for deep proteome analysis, thus achieving quantification and identification of these proteins; covering the part of functional analysis of genetic products, interaction studies, and protein localization, which helps explain the identity of an organism’s proteins to know the structure and function. However, considering that the proteome is highly dynamic due to the complex regulatory systems that control the levels of protein expression, its use is limited, since in addition to the use of specialized personnel, facilities and equipment, software is also included for equipment, and databases, which increases costs [80, 82, 83]. Proteomics is constantly updated, generating challenges ranging from sample preparation to data collection. A large amount of information is generated from protein folding models, three-dimensional structures, prediction of unknown protein structures and functions, data obtained from the separation of proteins by electrophoresis in two-dimensional gels, isoelectric focusing, 2D protein visualization, peptide mass fingerprinting (PMF), MS, MS in tandem, etc., the above generates high performance proteomes with the help of bioinformatics, which introduces new algorithms to handle a large amount of heterogeneous data [84, 85, 86].
Some of the most used platforms in proteomics are: The Basic Local Alignment Search Tool (BLAST), Expert Protein Analysis System (ExPASy) and Protein Data Bank (PDB); BLAST (
All this information is of great help, since it is used in different research areas, such as detection of diagnostic markers, candidates for vaccine production, understanding the mechanisms of pathogenicity, alteration of expression patterns in response to different signals and interpretation of functional protein pathways in different diseases [93, 94, 95, 96, 97, 98].
5. Comparative genomics
Comparative genomics is a broad field of study that identifies differences between genomes and elucidates which of them are responsible for phenotypic changes in organisms . In contrast to ‘traditional’ genomic studies that focus on a single genome per study , comparative genomics provides additional detailed information to that obtained from the analysis of a single genome, which can reveal the encoded functional potential of an organism compared to another [101, 102, 103]. Comparisons between different genomes of organisms lead to more rapid identification of different underlying mechanisms are shared between organisms and others that are different among them [104, 105, 106]. Likewise, comparative genomics allows a better understanding of how species have evolved . In this sense, the concept of pangenome (Figure 1) refers to the set of genes in a particular species . The commonly used partition of a pangenome considers three main parts: the central genome, the expendable or accessory genome and the singleton genome . The central genes are responsible for the basic aspects of the biology of the species and its main phenotypic features; while accessory genes and singletons generally belong to supplementary biochemical pathways and functions that can confer selective advantages such as ecological adaptation . While the global analysis of gene content (as in pangenome studies) provides information on differences in functional potential and possible phenotypic differences between organisms, specific central gene analyzes have also been used for studies of phylogenetic diversity [99, 108].
Initially, the concept of pangenome was used to refer to bacterial genomes, however, over time it has been used to refer to genomes of eukaryotic organisms such as yeasts [106, 109], plants [108, 110, 111], and viruses [108, 112]. Different organisms can be compared despite their phenotypic differences and with respect to their relationship of kinship (phylogenetic distances) [105, 113]. The assembly of genomes from sequencing data by Illumina or PacBio methods  involves five important stages, these steps are described in Figure 2, as well as some of the tools used .
For gene comparisons databases with different characteristics are used, for example, to obtain gene families and identify their orthology the EDGAR database [108, 115] is used, as well as, the prokaryotic-genome analysis tool (PGAT) for the analysis of bacterial genomes [108, 116]. There are independent applications such as the Pan-genome analysis pipeline (PGAP) that have specific modules to perform the functional analysis of genes, the analysis and determination of each of the components of the pangenome, the detection of genetic variation as well as the analysis of Species evolution [108, 117], PanFunPro is a tool that allows pangenome analysis in protein prediction from genetic information . There are tools that allow you to work with large amounts of data such as PanGP  and the large acale BSR .
The bacterial pan genome analysis tool (BPGA)  is a recently published package for pangenome analysis with seven functional modules; In addition to routine analysis, it presents a series of novel features for subsequent analyzes such as phylogeny, as well as tools that allow determining the presence and absence of certain genes in specific strains, another module to perform subset analysis, content analysis atypical G+C and KEGG & COG mapping of central, accessory and unique genes [108, 121, 122, 123, 124].
6. Functional genomics
Functional genomics studies and assigns functions to the genome of an organism, including genes and non-genetic elements [125, 126], with the support of molecular and cellular biology studies, focused on the dynamic aspects of transcriptomics, proteomics and metabolomics , that allow to know the relationship of genes, their transcription, translation and protein-protein interactions [128, 129], that promote the phenotypic characteristics of each organism [125, 126]. A functional genomic approach can use multiple techniques for data analysis in a single study . Apart from the tools of transcriptomics and proteomics, functional genomics needs of studies that allow us to know gene interactions [130, 131], genetic variations (polymorphisms) in different individuals through the study of SNPs [126, 132]. Likewise, it is important to know the regulation of genes in the expression of proteins that first carries out the analysis of promoter sequences, followed by the expression of the promoters and subsequently the expression of proteins [126, 133, 134]. Another study used for a rapid and systematic analysis of the expression of a large number of genes is the microarrays, which make it easier to observe the differential expression of genes from DNA or cDNA, as well as, allowing the finding gene functions novel and unexpected . In addition, compare the pattern of gene expression under different conditions . SAGE serial analysis of gene expression based on the study of cDNA allows to examine gene expression in a cell . To perform a functional genomic observation, an assembled and identified genome must be had, which does not contain gaps, to avoid erroneous annotations. Subsequently, the assembled genome is compared with a reference genome, which together allows to predict genes. Next, the mapped elements are combined, and the biological information that allows to define an optimal set of annotations or functions is assigned. At the end, the data will have to be validated, this is achieved through manual inspections, experimental checks and quality measures . To perform the genome annotation there are computational tools, one of the most used and friendly is Blast2GO which is a bioinformatics platform for high quality functional annotations and analysis of genomic data sets . The data obtained can be shared with the public through databases so that other researchers can access them. Currently, GEO of NCBI is the public functional genomics database that provides tools that help users in the consultation and download of data . Likewise, KEGG is a database that is used as a tool to understand the high-level functions and utilities of the biological system, such as the cell, the organism or the ecosystem, based on molecular level information, generated by sequencing of the genome and other high performance . There are also databases that store specific information on each of the most important model organisms (Table 2).
|Homo sapiens: variation in humans|||
7. Phylogeny in the protein evolutionary process
The sequencing of the genome of an organism, has allowed to know the set of all its genes, elucidating the functions and products that they express, as well as the mechanisms of regulation in different metabolic processes, where endless proteins participate. To determine their possible functions, biochemical and genetic analyzes are used in a classical way, however, sequencing has contributed to the knowledge about the type of amino acids that make it up, and through the use of software multiple sequences have been aligned, where they have those that have been fully characterized as well as proteins where their biochemical characteristics are unknown and by homology between amino acids can be inferred in the functions that these proteins can present . The use of bioinformatics, in protein analysis is a challenge, in recent years, phylogenetic profiles have been fundamental to relate homologous proteins by aligning their sequences, where it has been revealed that many share highly conserved regions and similar structures . Phylogeny analyzes the changes that occur within the sequences and groups them in a diagram with ramifications, called a phylogenetic tree, all those sequences that belong to the same family can be grouped into a clade and in turn into subfamilies, providing data on their evolution and functional diversity .
Eukaryotic cells during their evolution have captured microorganisms that originated mitochondria, chloroplasts and other organelles, where their genes have been transferred to the nuclear genome, allowing the transport of encoded proteins in the nucleus. The different locations of proteins in the cell, and the different proteins that participate in cellular processes, have originated phylogenetic analyzes on the location of proteins in the cell, finding that they are closely related to prokaryotic proteins that have eukaryotes. The proteins of chloroplasts and mitochondria have a composition of amino acids, length, sequences and conserved regions very similar to those of prokaryotes [152, 153]. One of the limitations to analyze proteins among related organisms is that genomes must be complete, in order to determine the presence or absence of genes in these species .
The high number of sequences that are stored in the different databases, have allowed to infer in the evolutionary relationships of different proteins, which when presenting homology retain their function during long evolutionary times, however, homologous proteins can perform the same activity, but the substrates they use can come from different routes . When organisms adapt to different environmental conditions they cause mutational changes in genome sequences, causing amino acid substitutions in enzymes, making them improve their efficiency and specificity, to maintain their catalytic function. Not all genes that code for proteins are susceptible to mutation, due to the presence of essential amino acids in function, stability and folding, and therefore a restriction is generated. Many of the mutations are usually random and, in those proteins, where these changes have been observed, it is due to an evolutionary pressure. If the protein plays an important role in the functions of the organism and the mutation brings improvements in activity, the change in the genome is maintained and optimized, favored by selective pressure, otherwise, when the function of the protein is not relevant. In the cell, the mutant gene is removed from the genome by random deletions. Evolutionary mechanisms have given rise to homologous protein families, which share a common ancestor . The study of ancestral enzymes has suggested that these presented a high thermostability, due to the Precambrian era that was thermophilic, in addition to the fact that most microorganisms and other organisms adapted to these environments with high temperatures. The ancestral protein alignments with the current ones show evidence of a slow evolution in structure, but not in amino acids . Therefore, enzymes are the product of years of evolution, where they have undergone changes to obtain a specific function, as well as greater affinity with the substrate and/or act on multisubstrates. Therefore, the genetic variability has generated homologous genes (they descend from a common ancestor and are called orthologs) that encode adapted proteins to perform their catalysis in extreme conditions. However, there are also paralogous genes, which have diverged, to encode proteins with different activities , many times a particular characteristic is preserved, such as the binding of a molecule or reaction mechanism, but they specialize in carrying out the same reaction but on different substrates, different regulation mechanisms, as well as cell localization. On the other hand, orthologous proteins tend to have the same function and their sequences have a high conservation .
To analyze these changes in the sequences, bioinformatics programs use algorithms and mathematical models, based on empirical matrices of amino acid substitution, as well as those that incorporate structural properties of the native state, such as secondary structure and accessibility . Protein phylogeny studies are currently necessary to know protein-protein interactions in biological systems. Molecular or structural analyzes on proteins will require more information to respond if a protein is present in one or several species, as well as to predict the common ancestor and evolution times . There are different methods to estimate the genetic distance of proteins, among the most used are the minimum distance, which predicts the phylogenetic relationship minimizing the total distance of the pairs of sequences adjacent nodes tree. While those of maximum parsimony and maximum likelihood, use the multiple sequence alignment, however, the maximum parsimony maximum builds a tree minimizing the total evolutionary changes between adjacent proteins and the maximum likelihood tries to minimize the probability of making such changes. The bioinformatics tools that use these algorithms are: TOPAL, Hennig86 and PAML, the computational packages that allow to occupy any of these are PHYLIP and PAUP, as well as MOLPHY, PASSML, PUZZLE, TAAR .
8. Protein modeling
One of the challenges of protein engineering and biology is to improve industrial processes, to achieve this it is necessary to determine the tertiary structure of proteins from the amino acid sequence, in order to design new proteins and even new medicines. Many of the protein structures that we know today have been obtained through experimentation by X-ray crystallography, NMR spectroscopy or cryo-EM, however, the large amount of proteins, makes these processes require more time and increase costs . Modeling through bioinformatics programs has managed to predict the atomic structure of several proteins from their amino acid sequence, by comparison with known protein structures, commonly called templates, although these do not present an accuracy with traditional techniques, the processes are faster and more economical in addition to providing low resolution data during sequence comparison [162, 163]. If the protein studied presents a homolog of known structure, the analysis is easier and the generated model is of higher resolution, but if the homologs do not exist or are not identified, the modeling is constructed from scratch . De novo modeling is based on the assembly of proteins using short peptide fragments, originating from known proteins based on similarity, although advances have been made using this process, it has only worked on proteins that contain less than 100 amino acids, on large proteins size is difficult to analyze due to lack of information, as well as the type of software used [161, 165].
The 3D protein structures provide data at the molecular level, functions and properties, among which are the study of the catalytic mechanism, design and improvement of ligands, union of macromolecules with proteins, functional relationships through structural similarity and identification of conserved residues . The interest in finding new protein models is generating a large amount of data, which is being stored in different databases, including Protein Data Bank, where the coordinates of the experimentally obtained atoms are stored; until 2014 this base contained more than 80 million sequences and more than 100,000 experimentally obtained 3D structures [166, 167]. These data have allowed the classification of proteins in different hierarchical levels as family, superfamily and fold in relation to their structure and evolution. All those that are grouped into a family are evolutionarily related to high sequence similarity. It is suggested that the different families that maintain a structure and function, present a common ancestor and are grouped into superfamilies and the difference between these is due to the folds or secondary structure that they possess . In the last decade, the predictions by computational models have revealed the structure and function of many proteins, but the advances have been in some cases slow and expensive, due to the programming methods used and the precision of these during modeling. Currently working on automated bioinformatics servers that will generate models with a high percentage of accuracy [168, 169]. One of the most used servers worldwide is SWIIS-MODEL, which was the first to model proteins through homology, and in recent years has been automated allowing complex modeling, as well as the introduction of the modeling engines ProMod3 and QMEAN [167, 170, 171]. Most modeling algorithms use the following steps: (1) Identification of related structures, (2) template choice, (3) target sequence alignment with templates, (4) molding construction, (5) model evaluation. However, one of the limitations during homology protein modeling is the choice of model proteins or templates as well as alignments against the problem sequences [172, 173]. When the similarity of the sequences between the problem protein and that of the databases is low, the relationship and alignment can be improved if structural information is included during the analysis . Advances in biocomputing have allowed the generation of tools for modeling proteins that are more reliable and easier to use, reducing time and cost in the analysis. However, it is necessary to carry out experimentation to confirm that the prediction is correct, in addition to improving the efficiency of the techniques and with more known protein sequences and stored in the databases, therefore the different bioinformatics tools will play an important role in the postgenomic era .
Bioinformatics has evolved with daily work, which has allowed us to know how the biological molecules of a cell interact for their proper functioning, in addition to predicting various biological phenomena. In the last decade, the omic sciences have generated a great amount of data increasing the knowledge of the biological functions so that in the future they are able to predict diseases or formulate drugs with greater efficiency, however it is still necessary, to have a higher percentage of sequenced genes of the different organisms, as well as protein sequences, that allow enriching the databases, and with this more precise mathematical models are generated, which will benefit the computer programs so that they are more efficient, reliable, easy to use, reducing time and cost in the analyzes. This discipline becomes an essential part of biological studies every day, so its expansion and growth will be infinite, due to the evolutionary changes that are taking place in the cells caused by the different environmental phenomena.
Conflict of interest
The authors declare no conflict of interest.