Open access peer-reviewed chapter - ONLINE FIRST

Characterization, Comparative, and Phylogenetic Analyses of Retrotransposons in Diverse Plant Genomes

By Aloysius Brown, Orlex B. Yllano, Leilani D. Arce, Ephraim A. Evangelista, Ferdinand A. Esplana, Lester Harris R. Catolico and Merbeth Christine L. Pedro

Submitted: April 7th 2021Reviewed: June 25th 2021Published: July 28th 2021

DOI: 10.5772/intechopen.99074

Downloaded: 35


Retrotransposons are transposable elements that use reverse transcriptase as an intermediate to copy and paste themselves into a genome via transcription. The presence of retrotransposons is ubiquitous in the genomes of eukaryotic organisms. This study analyzed the structures and determined the comparative distributions and relatedness of retrotransposons across diverse orders (34) and families (58) of kingdom Plantae. In silico analyses were conducted on 134 plant retrotransposon sequences using ClustalW, EMBOSS Transeq, Motif Finder, and MEGA X. So far, the analysis of these plant retrotransposons showed a significant genomic relationship among bryophytes and angiosperms (216), bryophytes and gymnosperms (75), pteridophytes and angiosperms (35), pteridophytes and gymnosperms (28), and gymnosperms and angiosperms (70). There were 13 homologous plant retrotransposons, 30 conserved domains, motifs (reverse transcriptase, integrase, and gag domains), and nine significant phylogenetic lineages identified. This study provided comprehensive information on the structures, motifs, domains, and phylogenetic relationships of retrotransposons across diverse orders and families of kingdom Plantae. The ubiquitousness of retrotransposons across diverse taxa makes it an excellent molecular marker to better understand the complexity and dynamics of plant genomes.


  • transposable elements
  • retrotransposon
  • genetic polymorphism
  • phylogenetic analysis
  • genome

1. Introduction

Retrotransposons can move within genomes due to their highly effective transposition mechanism. Because of this high level of transposition, their presence is a significant feature of plant genomes and other eukaryotic organisms. Since the discovery of transposable elements (TE) by Barbara McClintock more than seven decades ago, there have been several challenges in studying the structures of retrotransposons due to their repetitive structure, diversity in form, their large number in a genome, and their ability to replicate so frequently [1]. Even studying closely related genomes does not overcome this problem since retrotransposons also tend to be highly species-specific, a trait that makes them difficult to classify. Research has shown that they are not merely transient components of a genome but are instrumental in genomic development and adaptation, influencing these genomes from how chromosomes are structured to helping activate certain genes under certain conditions [2]. The interaction of retrotransposons with a host genome is not a simple one. Pieces of evidence have shown that they have helped shaped genomes for an extended period. In some cases, this has imparted important genetic traits to their host organisms. In others, they have been linked to mutagenesis and disease, prompting their host to develop regulatory safeguards to suppress and limit their activities [3].

Recent advances in sequencing technologies have come a long way in helping unravel the structure of plant genomes. Plant genomes are some of the most complex and diverse among known eukaryotic kingdoms [4] and vary widely in size across kingdom Plantae, with the smallest genomes sequenced so far being from green algae species [5] and the largest being Pinus taeda, which is around 22 Gbp in length [6]. A significant portion of the plant genome comprises transposable elements, the so-called “jumping genes” [7]. The diversity and size variation across plant genomes is primarily attributed to the activity of these transposable elements [8]. The transposable elements are known to have viral origins; in particular, retrotransposons structures closely resemble retroviruses without the gene for the viral envelope or with a nonfunctional envelope gene. It is hypothesized that transposable elements enter the genomes of eukaryotes through infection by ancient viruses and remained as parasitic elements in their host genomes [9]. More studies are needed to understand better the complexity of plant retrotransposons and unravel its salient features.

1.1 Classes and types of transposable elements

The complexity and diversity of transposable elements coupled with the availability of recent genomic sequences in the genebanks have generated various groupings of TEs. However, concerted efforts have been made to come up with a generally accepted and unified nomenclature. The replication process employed by transposable elements are used to classify them into two large groups [10]. Retrotransposons or Class I transposable elements use the enzyme reverse transcriptase to copy and paste themselves in the genome and are the most abundant type in plant genomes. DNA transposons or Class II transposable elements use other enzymes, including DNA polymerase and transposase, to copy and insert themselves into genomes [11]. This copy and paste mechanism is responsible for the significant number of transposable elements in eukaryotic genomes.

Class I Transposable Elements or Retrotransposons consists of the long terminal repeats (LTRs) retrotransposons and the non-long terminal repeats (non-LTRs) retrotransposons. These LTR retrotransposons and non-LTR retrotransposons are further subdivided based on their dynamics in the genome. The autonomous retrotransposons can be independently mobile, while the nonautonomous retrotransposons necessitate the presence of TEs for their movement. Some of the LTR retrotransposons in eukaryotes include Gypsy, Copia, BEL, DIRS, ERVI, ERV2, and ERV3 superfamilies. In contrast, superfamilies of non-LTR retrotransposons includes SINE1,2,3, LINES, CR1, CRE, I, RTE, TX1, Jockey, Penelope, R2, R4, RandI, Rex1, L1, and NeSL [12, 13].

A less well-studied class of retrotransposons in plant genomes are non-LTR retrotransposons. These are the LINEs-Long Interspersed Nuclear Elements and the SINEs-Short Interspersed Nuclear Elements. They do not exhibit much activity in plant genomes and constitute around 33.5% or about one third of the human genome [13]. More so, they contribute to new insertions in the human genome and have been linked to mutagenesis and human diseases [14].

LINEs are considered the oldest class of retrotransposons in plant genomes. Evidence suggests that they are highly regulated or inactive since their transcription is rarely observed in plant genomes [15]. In contrast, studies have shown that the ancient activity of SINEs helped shaped the genomic diversification of some monocot species [16] and the heterogeneity of many eukaryotic genomes, but apart from this, little is known so far of their activity in plant genomes [17]. With this, there is a need to study and characterize the diverse retrotransposons and understand how and to what extent they influence changes in a host genome.

1.2 Characterization of retrotransposons

The presence of transposable elements in an organism has many implications for its genomic activity. Depending on the region of the chromosome they are located on, they may affect what type of genes are expressed in the genome and the functions of these genes [18]. Gypsy retrotransposons have a widespread and more diverse position on the chromosomes in plant genomes, while Copia retrotransposons tend to cluster in proximal regions of the chromosomes they are located on [19]. However, it is worth pointing out that LTR retrotransposons tend to group in different chromosomal regions regardless of their lineages [20]. Research into plant genomic structures has yielded valuable insight into the characterization of retrotransposons due to their ubiquitous presence in plant genomes [21]. They are subclassified into LINES and SINES [22]. The LTR-retrotransposons are further classified into “superfamilies” based on their genetic sequences, namely, the Copia superfamily, the Gypsy superfamily, Bel-Pao, retrovirus, and endogenous retrovirus superfamilies [23]. Of these, the most widespread in plant genomes and the most well studied are the Gypsy and Copia superfamilies. Gypsy retrotransposons are differentiated from Copia retrotransposons by the position of the integrase protein in their genetic sequence. In gypsy retrotransposons, integrase is situated after the reverse transcriptase in the genetic sequence and before the reverse transcriptase in Copia retrotransposons [24]. Phylogenetic analyses and time of divergence are used to further divide these superfamilies into different lineages. The Copia superfamily comprised TORK, Bianca, Ale, Maximus lineages Gypsy superfamily of Attila, CRM, Del, and Galadriel lineages [25]. LTR-retrotransposons showcase such variety in number, position, and distribution in their host genome due to their unique ability to express the independent activity and replicate themselves numerous times on chromosomes [26].

A key feature of LTR retrotransposons and the structure that gives them their name is the presence of two homologous structures called long terminal repeats at both ends of their genetic sequence. These DNA sequences can vary in size from a hundred bps to thousands of bps [27]. These LTRs are non-coding regions that bracket the internal coding regions and are also a component of retroviral sequences [28]. LTR retrotransposons vary widely in size and functional characteristics. In plants, they have been documented as short as four kbp in Helianthusspecies [29] to over 23 kbp in Populus trichocarpa[30]. The structures of LTR retrotransposons are organized into one or several Open Reading Frames (ORF) [31]. The ORFs contains genetic information for the pol and gag genes and are integral to transcription in the host genome [32]. Like their retroviral counterparts, the gag genes encode functional polyproteins, and the pol gene usually contains the reverse transcriptase. These genes are typically separated by stop codons [33]. The pol gene encodes three important proteins, each of which has a crucial role in retrotransposal replication in the genome [34]. These proteins are Integrase, Protease, and Reverse Transcriptase [35]. Because retrotransposons replicate similarly to viruses, and their replication can lead to mutations and disrupt DNA repair, there are genomic mechanisms in place to silence their activity [36]. To escape this silencing, LTR retrotransposons may possess another region called the chromodomain. One mechanism the cell uses to silence retrotransposons is the formation of heterochromatin near areas of retrotransposon activity [37]. The presence of heterochromatin makes it difficult for the retrotransposon proteins to access the cell DNA, suppressing replication [38]. The chromodomain region encodes a protein that helps the retrotransposon escape silencing by manipulating these heterochromatins. Chromodomains are found upstream of the 3′ end of the genetic sequence in retrotransposons [39].

1.3 Mechanism of action

Retrotransposons insert and reinsert themselves in a host genome by transcription. This process is accomplished by the reverse transcription of an RNA intermediate transcript. This transcript is the template that is used to generate new copies of the retrotransposon [40]. The reverse transcription of retrotransposons is a complex procedure. In LTR retrotransposon, the process is helped by the long terminal repeats at each end of their structure that acts as start sites for replicating the internal region. The replication of this internal region occurs in opposite directions to produce two DNA strands. At the 3′ end, tRNA binds to the initiation site of the left LTR and replicates one of the two DNA strands. At the right LTR, a Polypurine Tract, which acts as a primer, binds immediately upstream of this region and replicates the second of the two DNA strands [41].

The mRNA template is synthesized first in the replication of retrotransposons. This mRNA template is then translated into proteins utilized in the process. The mRNA template has a U region and a short repeat sequence at each end. tRNA acts as a primer and binds to a primer binding site on the mRNA. This initiates the production of minus (−) strand DNA through the catalyzation of Reverse Transcriptase. The synthesized DNA reaches the U5 region at the 5′ end of the template and pairs with the repeat sequence at the 3′ end of the genomic RNA. Once synthesis of this first DNA strand is complete, the enzyme RNase H deteriorates the genomic RNA template, leaving only fragments. These fragments then prime the synthesis of the second DNA strand. As with the first strand, Reverse Transcriptase synthesizes another DNA strand but uses the first DNA strand as a template. At the end of this process, a linear double-stranded DNA is made with an LTR region (comprised of the repeat sequence, U5, and U3 regions) at each end. The enzyme integrase then inserts this new retrotransposon DNA into the host chromosomal DNA by using the 3’ OH of each strand to integrate at target sites a few base pairs apart in the genome [42].

1.4 Role of retrotransposons

Retrotransposons are known to be major drivers of genomic diversity and homogeneity during the development of eukaryotic genomes. Presently, their activity in plant genomes is regulated by different mechanisms, but they are still capable of bursts of activity when reactivated by mutations, adjacent gene expression, or environmental factors [43]. Grandbastien [44] has noted that all the retrotransposons that are known to be active in plant genomes are usually dormant during their host development but become active in response to environmental stressors. This could be linked to retrotransposons being proliferators of genomic diversity since their activation by stresses induces survival genes to turn on. The study by Hilbricht et al. [45] on Craterostigma plantagineumdehydration led to the isolation and identification of a retroelement gene, the Craterostigma desiccation-tolerant (CDT-1) gene, that is turned on by dehydration and imparts drought-resistant properties to the plant. This is also in line with Zhao et al. [46], which found a potential link of the OAR1 gene to the tolerance of osmotic and alkaline stresses in Arabidopsis thaliana. Though often characterized by their propensity to initiate mutagenesis, retrotransposons have been shown to affect the expression of genes they are adjacent to in the genome and even help regulate the structure of centromeres [47], as noted in an investigation of maize species by Gao et al. [48]. Analysis of tomato plants demonstrated that differences in volatile esters between two different colored fruits of different species of these plants are linked to the placement of retrotransposons near the family of esterases that exhibits a high level of enzyme activity. This placement results in a higher expression of the esterase, resulting in the reduced levels of multiple esters [49]. Retrotransposons have also been linked to disease resistance in plants. A study showed that activation of athila LTR retrotransposons led to genome expansion in Capsicum baccatumby increasing the number of a disease-resistant gene family [50] and analysis of Phaeodactylum tricornutumcells showed the activity of LTR-retrotransposon initiate a plant response to a decrease in nitrate and when exposed to reactive aldehydes that stress diatoms and leads to cell death [51]. Analysis of retrotransposon families in sorghum species shows that their activity influences genomic adaptation and diversity [52]. This finding suggests that retrotransposons play vital roles in regulating genes that encode functional proteins [53]. A study of Thale Cress and Adzuki bean seedlings treated with the DNA methylation inhibitor zebularine increased activity and accumulation of the retrotransposon ONSEN in the seedlings treated than in the control seedlings [54]. These studies point to the pivotal role of retrotransposons in plants’ adaptation to their environment and their contribution to genomic diversity.

This study compared, characterized, identified shared patterns, and determined the relationships of different retrotransposons across diverse plant taxa.


2. Materials and methods

To assemble the plant retrotransposon library, we collected genomic DNA sequences deposited at the National Center for Biotechnology Information (NCBI) nucleotide database. These were then further sorted to include only sequences with 300 to 800 base pairs in length. In total, 134 retrotransposon sequences were selected and analyzed in this study. Of these, 54 were angiosperms, 46 were gymnosperms, 11 were pteridophytes, three were liverworts, and 20 were bryophytes. The sequences were downloaded in the FASTA format and saved in a text document for further analyses. To study the characteristics of the plant retrotransposon sequences and identify homogeny, multiple sequence alignment (ClustalW) program was utilized. The parameters of the ClustalW analysis were defined as follows: Pairwise Alignment was set to slow and accurate for DNA sequences only. The Gap Open Penalty was set to 15 and the Gap Extension Penalty to 6.66. The Weight Matrix used was the International Union of Biochemistry (IUB) matrix for DNA sequences. These same parameters were used for the multiple sequence analysis with hydrophilic gaps included in the computation.

Motif analyses were performed on the plant retrotransposon sequences to identify motifs, protein domains, and conserved domains. The nucleotide sequences were translated into their corresponding amino acid (aa) sequences with the EMBOSS Transeq tool developed by the European Bioinformatics Institute. The algorithm was set to translate the nucleotide sequences into the three possible reading frames using the standard codon table. The translated aa sequences were then analyzed for protein domains, families, and functional sites using the PROSITE tool developed by the Swiss Institute of Bioinformatics [55] and the MOTIF Finder program of the Kyoto University Bioinformatics Center [56]. All three reading frames were analyzed to ensure the proper frame would be used for motif identification. The aligned retrotransposon sequences were analyzed using the MEGA-X. The software was used to construct a maximum likelihood phylogenetic tree with the Tamura-Nei method used to account for the substitution rate differences between nucleotides and the inequality of nucleotide frequencies. The Nearest-Neighbor-Interchange was used as the heuristic method to improve the likelihood of the tree. The phylogenetic tree generated by the MEGA-X program was then modified in the MEGA X Tree Topology Editor to produce a circular phylogenetic diagram for better data visualization.

3. Results and discussion

3.1 Multiple sequence alignment

Figure 1 shows the alignment scores of sequences produced from the multiple sequence alignment analysis performed in the clustalW program. These scores represent the pairwise alignment between each pair of retrotransposon sequences. The cutoff alignment score was set at 50 percent identity between two aligned sequences.

Figure 1.

Significant pairwise alignment scores of 134 plant retrotransposon sequences.

In total, there were 870 pairwise alignments with a 50 to 100 percent alignment score. Fifty-five percent (476) of the alignments had a percent identity in the range of 50 to 59. Thirty-two percent (281) had a percent identity in the range of 60 to 69. Seven percent (65) had a percent identity in the range of 70 to 79, 4% (35) had a percent identity in the range of 80 to 89, and 2% (13) had a percent identity in the range of 90 to 100. The multiple sequence alignment scores of 40% and higher are considered significant. However, an alignment less than 40% is considered too divergent [57]. The alignment score for this multiple sequence analysis was set to 50% to include only highly significant alignments.

3.2 Identification of homologous sequences

Table 1 contains the aligned sequences with the highest alignment score. There is a diversity in the relationship of these sequences. T. pellucida1 to T. pellucida2 are of the same species but clones. Each plant in the sequence pairs alignments of A. concolorto A. veitchii, L. gmeliniito L. czekanowskii, A. sativato A. sterilis, A. ipaensisto A. hypogaea, and V. dubyanato F. antipyretica, belong to the same genus. A. araucanaand A. browniibelong to the same family. The sequences aligned in each alignment pair of L. saxicolato P. schreberiand D. polysetumto L. glaucumbelong to the same order, while those in the pairs of S. cooperito D. truncatulaand P. cuspidatumto R. canescensbelong to the same class. Sequences belonging to only the same division can be closely related, as in the case of P. patensto M. polymorphawith a 99% identity and N. tetragonato M. grandflorawith a 100% identity. The pair of sequences aligned in the same genus had the highest number of aligned pairs.

Sequences AlignedAligned Score
A. concolor: A. veitchii90
L. saxicola: P. schreberi91
S. cooperi: D. truncatula91
D. polysetum: L. glaucum93
P. cuspidatum: R. canescens93
L. gmelinii: L. czekanowskii94
A. araucana: A. brownii94
A. sativa: A. sterilis94
A. ipaensis: A. hypogaea95
P. patens: M. polymorpha 199
T. pellucida1: T. pellucida 2100
V. dubyana: F. antipyretica100
N. tetragona: M. grandiflora 2100

Table 1.

Aligned sequences with an alignment score of 90 to 100.

The results above confirm the highly conserved nature of retrotransposons. This was supported by the study of retrotransposons in mammals [58]. Despite their enormous size and diversity, it has been noted that similar retrotransposons tend to cluster together in similar genomes of hosts belonging to the same order, family, or class [59]. Specific types of retrotransposons belonging to the same family or lineage can be conserved across a particular kingdom or division [60]. The presence of homologs can be inferred from these aligned sequences considering their high percent identity and their distribution to different species [61]. An alignment of 90 and higher was used as the cutoff value for homolog identification [62].

3.3 Conservation of retrotransposons

Table 2 is a summation of retrotransposons sequences with an alignment score of 80 to 89. This is the pairwise alignment score between pairs of sequences.

Sequences AlignedAligned ScoreSequences AlignedAligned Score
L. saxicola: P. polyantha80P. contorta: L. sibirica86
L. saxicola: D. polysetum80S. obtusum: A. rupestris87
D. polysetum: R. canescens80L. occidentalis: L. kaempferi87
L. glaucum: P. cuspidatum80L. occidentalis: P. schrenkiana87
P. polyantha: L. glaucum81P. contorta: P. schrenkiana87
P. polyantha: P. cuspidatum81L. kaempferi: P. rubens87
P. polyantha: R. canescens81P. rubens: P. schrenkiana87
P. polyantha: D. polysetum82J. communis: T. baccata87
P. polyantha: H. ciliata82S. cooperi: N. exaltata88
D. polysetum: P. cuspidatum82D. truncatula: N. exaltata88
G. biloba2: P. rubens82P. contorta: L. kaempferi88
P. schreberi: P. polyantha83P. contorta: P. rubens88
G. biloba2: P. contorta83A. veitchii: A. balsamea88
L. occidentalis: L. sibirica83L. occidentalis: P. rubens89
L. sibirica: P. rubens84L. kaempferi: L. sibirica89
L. sibirica: P. schrenkiana84L. kaempferi: P. schrenkiana89
G. biloba2: P. schrenkiana85
A. concolor: A. balsamea85
L. occidentalis: P. contorta86

Table 2.

Aligned sequences with an alignment score of 80 to 89.

Aligned sequence pairs in the same genus were: L. occidentalisto L. sibirica, A. concolorto A. balsamea, L. occidentalisto L. kaempferia, P. rubensto P. schrenkiana, A. veitchiito A. balsamea, and L. kaempferito L. sibirica. More so, the aligned sequences pairs that had sequences in the same family were: L. sibiricato P. rubens, L. sibiricato P. schrenkiana, L. occidentalisto P. contorta, P. contortato L. sibirica, L. occidentalisto P. schrenkiana, P. contortato P. schrenkiana, L. kaempferito P. rubens, P. contortato L. kaempferi, P. contortato P. reubens, L. occidentalisto P. rubens, and L. kaempferito P. schrenkiana. At the same order level, the following were the aligned sequence pairs: L. saxicolato P. polyantha, J. communisto T. baccata, and D. tuncatulato N. exaltata. Aligned sequence pairs that had sequences in the same class were: L. saxicolato D. polysetum, D. polysetumto R. canescens, L. glaucumto P. cuspidatum, P. polyanthato L. glaucum, P. polyanthato P. cuspidatum, P. polyanthato R. canescens, P. polyanthato D. polysetum, P. polyanthato H. ciliata, D. polysetumto P. cuspidatum, P. schreberito P. polyanthaand S. cooperito N. exaltata. Likewise, the aligned sequence pairs with sequences in the same division were: G. bilobato P. schrenkiana, G. bilobato P. contorta, G. bilobato P. rubens, and S. obtusumto A. rupestris.

3.4 Motifs and domains

Molecular characterization is important in understanding the nature of any genetic element and its insertion origin in a genome. Molecular characterization provides a detailed description of the structure of a genetic sequence, changes that it induces in a genome, and how it affects genetic expression [63]. Characterization is an important feature in the study of retrotransposons. It is also used for classifying retrotransposons [64], uncovering their associations in a genome [65, 66], and discovering new types of retrotransposons (Table 3) [66].

Reverse transcriptase (RNA-dependent DNA polymerase)Simian taste bud-specific gene product family
Reverse transcriptase (RNA-dependent DNA polymerase)Simian taste bud-specific gene product family
Tsi6BAFF-R, TALL-1 binding
RNase H-like domain found in reverse transcriptaseZinc knuckle
Tc5 transposase DNA-binding domainGAG-polyprotein viral zinc-finger
Peptidase propeptide and YPEB domainMis6
Integrase zinc-binding domainProtein prenyltransferase alpha subunit repeat
Integrase core domainChromatin remodeling factor Mit1 C-terminal Zn finger 2
H2C2 zinc finger5′-3′ exonuclease, N-terminal resolvase-like domain
gag-polypeptide of LTR copia-typeRetrotransposon gag protein
Aspartyl proteaseC2H2 zinc-finger
gag-polyprotein putative aspartyl proteaseGAG-pre-integrase domain
Retroviral aspartyl proteaseEukaryotic translation initiation factor 3 subunit G
Domain of unknown function3′ exoribonuclease family, domain 2
Putative peptidase (DUF1758)HicA toxin of bacterial toxin-antitoxin,
Fimbrial assembly protein (PilN)BRK domain

Table 3.

Motifs and domains identified by the MOTIF finder.

The identification of the reverse transcriptase motif in these retrotransposon sequences is significant because it is not only integral to the replication process of retrotransposons but is one of the most significant parts of their structure [67]. The reverse transcriptase type identified in these sequences was only found in LTR retrotransposons and retroviruses. The presence of this reverse transcriptase type usually indicates that the sequence is a retrotransposon mobile element or a retrovirus [68]. Reverse transcriptase gene identification could be used to identify retrotransposon sequences due to their high specificity. Reverse transcriptases are known to be multidomain enzymes, with notable domains being the catalytic domain and the RNase H domain [69]. The Tc5 transposase DNA-binding domain is a structural motif found in many proteins that regulate gene expression. The RNase H-like domain found in these retrotransposon sequences belongs to a reverse transcriptase subfamily that shares sequence similarity with reverse transcriptases from endogenous retroviruses of the zebrafish and the Moloney mouse leukemia retroviruses [69, 70]. This finding strengthens the viral origins of retrotransposons in eukaryotes.

The presence of the zinc-binding domain indicates the presence of integrase since it is one of the domains in the integrase enzyme. Integrase allows retroviruses and retroelements to insert their DNA into a host genome [71]. The integrase core domain that was also detected in this sequence is one of the three known domains of the integrase enzyme. It is the catalytic domain that catalyzes the transfer of retroviral or retrotransposal DNA made by reverse transcriptase to the site in the genome where it will be inserted [72]. GAG-Pre-Integrase domain lies upstream of the integrase region in retroviral polyproteins. They are usually connected to elements that assist in retroviral insertion [73].

The Copia family of retrotransposons is a large retrotransposon family active in the genomes of plants. It is classified under the long terminal repeats retrotransposons along with the Gypsy family [74]. The GAG Polypeptide of the LTR-Copia type domain is highly conserved and found only in Copia retrotransposons [75]. This domain was identified in seven species: G. biloba, L. occidentalis, P. contorta, L. kaempferi, L. sibirica, P. rubens, P. schrenkiana, definitively identifying them as Copia family retrotransposons.

Some domains were identified that are not generally associated with retrotransposons. The Hic A toxin functions as an mRNA interferase in bacteria and archaea species [76], Tsi6 is a bacterial immunity protein, and the Fimbrial Assembly Protein functions in the production of bacterial fimbria used for cellular attachment [77]. The Simian taste-bud specific gene is found in primates, and mutations of this gene have been linked to follicular lymphomas [78]. The Mis6 protein is integral for chromosome segregation during mitosis, and the protein prenyltransferase alpha subunit repeat functions in protein prenylation. In contrast, the eukaryotic translation initiation factor 3 subunit G initiates protein synthesis [79], and the BAFF-R is a polypeptide that binds to the ligands of TALL-1, a tumor necrosis factor that initiates inflammation in humans [80]. Zinger finger proteins are a large family of proteins noted for their role as transcription factors and their ability to bind Zn ions. Several of these protein types were identified from the plant retrotransposon sequences, including: H2C2 zinc finger, zinc knuckle, GAG-polyprotein viral zinc-finger, chromatin remodeling factor Mit1 C-terminal Zn finger 2, and C2H2 zinc-finger. Recent studies revealed that they are highly involved in regulating plant response to abiotic stressors in their environment [81]. Peptidase propeptide and YPEB domain, putative peptidase (DUF1758), 5′-3′ exonuclease, N-terminal resolvase-like domain, and the BRK domain are all hypothetical proteins of which little to nothing is known of their activity presently [82].

3.5 Patterns and profiles

The PROSITE database has an extensive collection of protein families, subfamilies, domains, and motifs managed by the Swiss Institute of Bioinformatics [83]. The database is organized into unique protein profiles and patterns to identify functional sites, domains, and protein families [84].

Table 4 contains the PROSITE patterns of four motifs found in the PROSITE database. IPNS_1 was found in E. arvense, ASP_PROTEASE in G. biloba, ZINC_PROTEASE in P. contorta, and TONB_DEPENDENT REC 1 in T. aestivum. Isopenicillin N synthase signature 1 is an enzyme found in bacterial and fungal species instrumental in the production of cephalosporin and penicillin [85]. TonB-dependent receptor proteins signature 1 is a type of protein found in E. coliinvolved in cellular transportation of substrates into the periplasmic space by active transport [86]. The presence of these bacterial domains in plant retrotransposons supports their role as genetic reservoirs. Because of their transposable nature, they can “jump” from bacterial plasmids onto chromosomes, carrying genes with them [87].

Found MotifDescription
IPNS_1PS00185, Isopenicillin N synthase signature 1
ASP_PROTEASEPS00141, Eukaryotic and viral aspartyl proteases active site
ZINC_PROTEASEPS00142, Neutral zinc metallopeptidases, zinc-binding region signature
TONB_DEPENDENT_REC_1PS00430, TonB-dependent receptor proteins signature 1

Table 4.

Patterns identified from plant retrotransposons.

Aspartyl proteases are a family of enzymes that hydrolyzes peptide bonds [88]. They are very diverse and can be found in species including humans, retroviruses, plants, and fungi. In retroviruses, they are usually encoded in the pol gene as part of a polypeptide [89]. The zinc protease utilizes zinc in its catalytic function to break down polyproteins. Retrotransposon’s polyproteins are very important elements of their replication mechanism, and these proteases enable the hydrolysis of these larger proteins into smaller functional polypeptides [90]. The Pol polyproteins and proteases are needed in retrotransposon replication to form mRNA and its packaging in the transposition of retrotransposons [91].

Table 5 contains the four PROSITE profiles identified in the retrotransposon sequences. The Reverse Transcriptase catalytic domain profile was detected in 25 different species, the Integrase catalytic domain profile in four species, and the zinc finger CCHC-type profile, and the zinc finger SWIM-type profile in one species each. Reverse Transcriptase is a multidomain enzyme consisting of two domains: The Catalytic Domain and the RNase H binding domain. These two domains are used to perform the three enzymatic actions of Reverse Transcriptase [92]. The Catalytic Domain carries out the polymerase activities using DNA-dependent polymerase and RNA-dependent polymerase. The RNase H domain is responsible for the ribonuclease enzymatic activity [93]. Together, these two reverse transcriptase domains enable the “copy” part of the retrotransposon replication mechanism.

Found MotifDescription
RT_POLPS50878, Reverse transcriptase (RT) catalytic domain profile
INTEGRASEPS50994, Integrase catalytic domain profile
ZF_CCHCPS50158, Zinc finger CCHC-type profile
ZF_SWIMPS50966, Zinc finger SWIM-type profile

Table 5.

Profiles identified from plant retrotransposons.

The integrase is also a multidomain enzyme (Table 5). Its structure consists of three domains integral to its function: An N-terminal zinc finger domain, a C-terminal DNA binding domain, and the Integrase core domain between them [94]. These integrase domains are responsible for the “paste” part of retrotransposon replication, allowing them to transpose themselves into other sites of their host genome [95]. The CCHC zinc finger is associated with retroviruses. They are found in the capsid protein and aids the virus in host infection [96]. The presence of this protein confirms the relationship between retroviruses and retrotransposons. They have developed from retroviruses and still retain proteins for the viral capsids and envelopes [97]. These proteins have been repurposed from aiding in viral infection to assisting in DNA and RNA binding [98].

The SWIM-type zinc finger was isolated from a retrotransposon sequence of Manihot esculenta(Table 5). The SWIM zinc finger is found in all major eukaryotic groups. It has a strong association with the plant MuDR family of transposases. These enzymes belong to the MuDR transposon, a part of one of the largest families of transposons in plants, the Mu family [99]. They are known mutagens, which is in line with one of the characteristics of transposable elements as instigators of mutagenesis in their host genomes [100].

3.6 Phylogenetic analysis

The phylogenetic analysis uses characters like nucleotide or amino acid sequences to construct a tree to show the relationship among different taxa at the molecular level. This analysis can also investigate domain relationships within an individual taxon [101], and this has become an essential tool for comparing genetic data between different species and groups [102].

The history of these retrotransposons was analyzed and created using the Maximum Likelihood method and Tamura-Nei model [103]. The initial tree and guide tree for the heuristic search were obtained by applying the Neighbor-Joining method to a matrix of pairwise distances estimated using the Tamura-Nei model. All the codon positions included were 1st+2nd+3rd+noncoding translated proteins. The final dataset consisted of 892 positions. The MEGA X program was used to investigate relationship analyses [104]. The neighbor-joining tree algorithm was tested with bootstrap replicates of 1000 [105] and the resulting bootstrap values displayed above the tree’s nodes. The cutoff value for the tree branches was set at 70% [106] to identify lineage clusters. The largest of these clusters with values above the cutoff is the group “C,” which contained well-supported branches of retrotransposon lineages. All the plant sequences in this group were from bryophytes. Well-supported groups were group “B” (M. grandiflora 1and M. polymorpha 2), group “E” (A. sativaand A. sterilis), group “F” (S. cooperiand D. truncatula), group “G” (M. esculentaand F. virosa), and group “I” (N. tetragonaand M. grandiflora 2) (Figure 2). Likewise, moderately supported groups (Figure 3) were group “A” (M. polymorpha 3and M. notabilis), group “D” (V. speciosa 2and B. papyrifera), and group “H” (P. patens 2and L. lagopus 2) [107].

Figure 2.

Well supported bootstrap branches based on the phylogenetic analysis.

Figure 3.

Moderately supported bootstrap branches based on the phylogenetic analysis.

Figure 4 shows the circular ideogram of diverse retrotransposons across range-wide orders and families of the kingdom Plantae. This ideogram was constructed to ensure holistic visualization of large-scale data and efficiently visualize enormous amounts of genomic information.

Figure 4.

Circular ideogram of retrotransposons across diverse plant genomes.

The “red” group on the upper right was represented by a cluster of retrotransposons from gymnosperms, while the “blue” group had retrotransposons originating from angiosperm. The “green” group had two novel retrotransposons, namely, Silava and Romani, distinct for gymnosperms. The “yellow” group comprises Gypsy family retrotransposons from angiosperms except for M. polymorpha and P. massoniana, a liverwort and gymnosperm, respectively. The “orange” group is the largest cluster composed of Gypsy family retrotransposons from the bryophytes. The “purple” group is a clade of two gymnosperm retrotransposons from the Gypsy and Copia families. In contrast, the “brown” group is a clade of two gymnosperms Copia retrotransposons, and the “pacific blue” group is a clade of non-LTR retrotransposons from two eudicots. The “ruby” group is a cluster of Copia family retrotransposons, and the “Davidson orange” group comprises mostly Gypsy retrotransposons with some notable novel-type families (Cereba, N1, Osr30, and Silava). Osr30 is distinct to O. sativa, Cereba to cereal plants, and Silava to gymnosperms. The “pink” group is a cluster of angiosperm Gypsy retrotransposons, the “medium green” is a cluster of giant ferns Cassandra retrotransposons, and the “tyrian purple” is a cluster of Poaceae family retrotransposons. The “lochmara blue” is a cluster of Copia-like retrotransposons, and the “deep red” group is a cluster of angiosperm retrotransposons [108]. The “verdigris” group is a cluster of Gypsy retrotransposons with the inclusion of Silava retrotransposons. It was noted that Silava retrotransposons tend to cluster with Gypsy retrotransposons. The “saddle brown” group is a cluster of Copia retrotransposons with two novel Copia-like retrotransposons (RTE & Conagree). All black clusters formed the mixed groups.

Retrotransposons of the gypsy family tend to cluster together (Figure 4). The Gypsy family is the largest group, forming a large cluster of bryophyte sequences and eudicot sequences with few liverworts and gymnosperms sequences forming outgroups of these clades. Gypsy retrotransposons are very diversified and more widespread in plant genomes than Copia retrotransposons [109]. Retrotransposons of the Copia family tend to be grouped based on the plant group they belong to. These retrotransposons are interspersed with novel families of retrotransposons that are Copia-like in structure. Copia-like retrotransposons are common in plant genomes and are identified by their reverse transcriptase, similar in structure to the Copia family retrotransposons [110]. Gymnosperm retrotransposons are grouped together regardless of family, and they are associated with monocot retrotransposons. Possibly, this attribute could be the result of retrotransposal duplication events in these genomes [111]. Notably, retrotransposons are more active in the Poaceae family [112], leading to the genesis of more unique and novel retrotransposon families.

4. Conclusions

Retrotransposons are such a significant part of plant genomes that they warrant more studies to understand them better. Retrotransposons were conserved in nature, tended to cluster in different plant families and classes, and revealed significant genome relationships between different families within a plant division. Retrotransposons were characterized by certain motifs and domains useful in classifying them and helping understand their role in plant genomes. Plant retrotransposons exhibited much diversification while also retaining the conservation of certain parts of their structures. Retrotransposons in plant genomes retained genes from other life domains, just as they reserved harmful genes. They can also keep useful genes essential in helping their hosts survive adverse conditions. Findings in the PROSITE amino acid patterns and profiles found that some of these plant retrotransposons contain viral, bacterial, fungal, and mammalian genes. The high specificity of retrotransposal Reverse Transcriptase could be used as an important tool in identifying retrotransposons. More so, phylogenetic analysis revealed the relationships of the retrotransposons and unveiled their diversification into several lineages. This study provided valuable information on the characteristics, patterns, profiles, diversity, and phylogenetic relationship of retrotransposons across the range-wide plant orders and families and are necessary in understanding the functions, complexity, and dynamics of plant genomes.


We would like to thank the faculty members of the Department of Biology, College of Science and Technology, Adventist University of the Philippines, and reviewers for the valuable comments. The National Center for Biotechnology Information, Bethesda MD, USA for the DNA sequences. We are grateful to Sir Owen E. Pitakia, Dr. Edwin Balila, and Dr. Lorcelie Taclan for their indispensable counsels and support.


Conflict of interest

The authors declare no conflict of interest.


chapter PDF

© 2021 The Author(s). Licensee IntechOpen. This chapter is distributed under the terms of the Creative Commons Attribution 3.0 License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.

How to cite and reference

Link to this chapter Copy to clipboard

Cite this chapter Copy to clipboard

Aloysius Brown, Orlex B. Yllano, Leilani D. Arce, Ephraim A. Evangelista, Ferdinand A. Esplana, Lester Harris R. Catolico and Merbeth Christine L. Pedro (July 28th 2021). Characterization, Comparative, and Phylogenetic Analyses of Retrotransposons in Diverse Plant Genomes [Online First], IntechOpen, DOI: 10.5772/intechopen.99074. Available from:

chapter statistics

35total chapter downloads

More statistics for editors and authors

Login to your personal dashboard for more detailed statistics on your publications.

Access personal reporting

We are IntechOpen, the world's leading publisher of Open Access books. Built by scientists, for scientists. Our readership spans scientists, professors, researchers, librarians, and students, as well as business professionals. We share our knowledge and peer-reveiwed research papers with libraries, scientific and engineering societies, and also work with corporate R&D departments and government entities.

More About Us