Aligned sequences with an alignment score of 90 to 100.
Retrotransposons are transposable elements that use reverse transcriptase as an intermediate to copy and paste themselves into a genome via transcription. The presence of retrotransposons is ubiquitous in the genomes of eukaryotic organisms. This study analyzed the structures and determined the comparative distributions and relatedness of retrotransposons across diverse orders (34) and families (58) of kingdom Plantae. In silico analyses were conducted on 134 plant retrotransposon sequences using ClustalW, EMBOSS Transeq, Motif Finder, and MEGA X. So far, the analysis of these plant retrotransposons showed a significant genomic relationship among bryophytes and angiosperms (216), bryophytes and gymnosperms (75), pteridophytes and angiosperms (35), pteridophytes and gymnosperms (28), and gymnosperms and angiosperms (70). There were 13 homologous plant retrotransposons, 30 conserved domains, motifs (reverse transcriptase, integrase, and gag domains), and nine significant phylogenetic lineages identified. This study provided comprehensive information on the structures, motifs, domains, and phylogenetic relationships of retrotransposons across diverse orders and families of kingdom Plantae. The ubiquitousness of retrotransposons across diverse taxa makes it an excellent molecular marker to better understand the complexity and dynamics of plant genomes.
- transposable elements
- genetic polymorphism
- phylogenetic analysis
Retrotransposons can move within genomes due to their highly effective transposition mechanism. Because of this high level of transposition, their presence is a significant feature of plant genomes and other eukaryotic organisms. Since the discovery of transposable elements (TE) by Barbara McClintock more than seven decades ago, there have been several challenges in studying the structures of retrotransposons due to their repetitive structure, diversity in form, their large number in a genome, and their ability to replicate so frequently . Even studying closely related genomes does not overcome this problem since retrotransposons also tend to be highly species-specific, a trait that makes them difficult to classify. Research has shown that they are not merely transient components of a genome but are instrumental in genomic development and adaptation, influencing these genomes from how chromosomes are structured to helping activate certain genes under certain conditions . The interaction of retrotransposons with a host genome is not a simple one. Pieces of evidence have shown that they have helped shaped genomes for an extended period. In some cases, this has imparted important genetic traits to their host organisms. In others, they have been linked to mutagenesis and disease, prompting their host to develop regulatory safeguards to suppress and limit their activities .
Recent advances in sequencing technologies have come a long way in helping unravel the structure of plant genomes. Plant genomes are some of the most complex and diverse among known eukaryotic kingdoms  and vary widely in size across kingdom Plantae, with the smallest genomes sequenced so far being from green algae species  and the largest being
1.1 Classes and types of transposable elements
The complexity and diversity of transposable elements coupled with the availability of recent genomic sequences in the genebanks have generated various groupings of TEs. However, concerted efforts have been made to come up with a generally accepted and unified nomenclature. The replication process employed by transposable elements are used to classify them into two large groups . Retrotransposons or Class I transposable elements use the enzyme reverse transcriptase to copy and paste themselves in the genome and are the most abundant type in plant genomes. DNA transposons or Class II transposable elements use other enzymes, including DNA polymerase and transposase, to copy and insert themselves into genomes . This copy and paste mechanism is responsible for the significant number of transposable elements in eukaryotic genomes.
Class I Transposable Elements or Retrotransposons consists of the long terminal repeats (LTRs) retrotransposons and the non-long terminal repeats (non-LTRs) retrotransposons. These LTR retrotransposons and non-LTR retrotransposons are further subdivided based on their dynamics in the genome. The autonomous retrotransposons can be independently mobile, while the nonautonomous retrotransposons necessitate the presence of TEs for their movement. Some of the LTR retrotransposons in eukaryotes include Gypsy, Copia, BEL, DIRS, ERVI, ERV2, and ERV3 superfamilies. In contrast, superfamilies of non-LTR retrotransposons includes SINE1,2,3, LINES, CR1, CRE, I, RTE, TX1, Jockey, Penelope, R2, R4, RandI, Rex1, L1, and NeSL [12, 13].
A less well-studied class of retrotransposons in plant genomes are non-LTR retrotransposons. These are the LINEs-Long Interspersed Nuclear Elements and the SINEs-Short Interspersed Nuclear Elements. They do not exhibit much activity in plant genomes and constitute around 33.5% or about one third of the human genome . More so, they contribute to new insertions in the human genome and have been linked to mutagenesis and human diseases .
LINEs are considered the oldest class of retrotransposons in plant genomes. Evidence suggests that they are highly regulated or inactive since their transcription is rarely observed in plant genomes . In contrast, studies have shown that the ancient activity of SINEs helped shaped the genomic diversification of some monocot species  and the heterogeneity of many eukaryotic genomes, but apart from this, little is known so far of their activity in plant genomes . With this, there is a need to study and characterize the diverse retrotransposons and understand how and to what extent they influence changes in a host genome.
1.2 Characterization of retrotransposons
The presence of transposable elements in an organism has many implications for its genomic activity. Depending on the region of the chromosome they are located on, they may affect what type of genes are expressed in the genome and the functions of these genes . Gypsy retrotransposons have a widespread and more diverse position on the chromosomes in plant genomes, while Copia retrotransposons tend to cluster in proximal regions of the chromosomes they are located on . However, it is worth pointing out that LTR retrotransposons tend to group in different chromosomal regions regardless of their lineages . Research into plant genomic structures has yielded valuable insight into the characterization of retrotransposons due to their ubiquitous presence in plant genomes . They are subclassified into LINES and SINES . The LTR-retrotransposons are further classified into “superfamilies” based on their genetic sequences, namely, the Copia superfamily, the Gypsy superfamily, Bel-Pao, retrovirus, and endogenous retrovirus superfamilies . Of these, the most widespread in plant genomes and the most well studied are the Gypsy and Copia superfamilies. Gypsy retrotransposons are differentiated from Copia retrotransposons by the position of the integrase protein in their genetic sequence. In gypsy retrotransposons, integrase is situated after the reverse transcriptase in the genetic sequence and before the reverse transcriptase in Copia retrotransposons . Phylogenetic analyses and time of divergence are used to further divide these superfamilies into different lineages. The Copia superfamily comprised TORK, Bianca, Ale, Maximus lineages Gypsy superfamily of Attila, CRM, Del, and Galadriel lineages . LTR-retrotransposons showcase such variety in number, position, and distribution in their host genome due to their unique ability to express the independent activity and replicate themselves numerous times on chromosomes .
A key feature of LTR retrotransposons and the structure that gives them their name is the presence of two homologous structures called long terminal repeats at both ends of their genetic sequence. These DNA sequences can vary in size from a hundred bps to thousands of bps . These LTRs are non-coding regions that bracket the internal coding regions and are also a component of retroviral sequences . LTR retrotransposons vary widely in size and functional characteristics. In plants, they have been documented as short as four kbp in
1.3 Mechanism of action
Retrotransposons insert and reinsert themselves in a host genome by transcription. This process is accomplished by the reverse transcription of an RNA intermediate transcript. This transcript is the template that is used to generate new copies of the retrotransposon . The reverse transcription of retrotransposons is a complex procedure. In LTR retrotransposon, the process is helped by the long terminal repeats at each end of their structure that acts as start sites for replicating the internal region. The replication of this internal region occurs in opposite directions to produce two DNA strands. At the 3′ end, tRNA binds to the initiation site of the left LTR and replicates one of the two DNA strands. At the right LTR, a Polypurine Tract, which acts as a primer, binds immediately upstream of this region and replicates the second of the two DNA strands .
The mRNA template is synthesized first in the replication of retrotransposons. This mRNA template is then translated into proteins utilized in the process. The mRNA template has a U region and a short repeat sequence at each end. tRNA acts as a primer and binds to a primer binding site on the mRNA. This initiates the production of minus (−) strand DNA through the catalyzation of Reverse Transcriptase. The synthesized DNA reaches the U5 region at the 5′ end of the template and pairs with the repeat sequence at the 3′ end of the genomic RNA. Once synthesis of this first DNA strand is complete, the enzyme RNase H deteriorates the genomic RNA template, leaving only fragments. These fragments then prime the synthesis of the second DNA strand. As with the first strand, Reverse Transcriptase synthesizes another DNA strand but uses the first DNA strand as a template. At the end of this process, a linear double-stranded DNA is made with an LTR region (comprised of the repeat sequence, U5, and U3 regions) at each end. The enzyme integrase then inserts this new retrotransposon DNA into the host chromosomal DNA by using the 3’ OH of each strand to integrate at target sites a few base pairs apart in the genome .
1.4 Role of retrotransposons
Retrotransposons are known to be major drivers of genomic diversity and homogeneity during the development of eukaryotic genomes. Presently, their activity in plant genomes is regulated by different mechanisms, but they are still capable of bursts of activity when reactivated by mutations, adjacent gene expression, or environmental factors . Grandbastien  has noted that all the retrotransposons that are known to be active in plant genomes are usually dormant during their host development but become active in response to environmental stressors. This could be linked to retrotransposons being proliferators of genomic diversity since their activation by stresses induces survival genes to turn on. The study by Hilbricht et al.  on
This study compared, characterized, identified shared patterns, and determined the relationships of different retrotransposons across diverse plant taxa.
2. Materials and methods
To assemble the plant retrotransposon library, we collected genomic DNA sequences deposited at the National Center for Biotechnology Information (NCBI) nucleotide database. These were then further sorted to include only sequences with 300 to 800 base pairs in length. In total, 134 retrotransposon sequences were selected and analyzed in this study. Of these, 54 were angiosperms, 46 were gymnosperms, 11 were pteridophytes, three were liverworts, and 20 were bryophytes. The sequences were downloaded in the FASTA format and saved in a text document for further analyses. To study the characteristics of the plant retrotransposon sequences and identify homogeny, multiple sequence alignment (ClustalW) program was utilized. The parameters of the ClustalW analysis were defined as follows: Pairwise Alignment was set to slow and accurate for DNA sequences only. The Gap Open Penalty was set to 15 and the Gap Extension Penalty to 6.66. The Weight Matrix used was the International Union of Biochemistry (IUB) matrix for DNA sequences. These same parameters were used for the multiple sequence analysis with hydrophilic gaps included in the computation.
Motif analyses were performed on the plant retrotransposon sequences to identify motifs, protein domains, and conserved domains. The nucleotide sequences were translated into their corresponding amino acid (aa) sequences with the EMBOSS Transeq tool developed by the European Bioinformatics Institute. The algorithm was set to translate the nucleotide sequences into the three possible reading frames using the standard codon table. The translated aa sequences were then analyzed for protein domains, families, and functional sites using the PROSITE tool developed by the Swiss Institute of Bioinformatics  and the MOTIF Finder program of the Kyoto University Bioinformatics Center . All three reading frames were analyzed to ensure the proper frame would be used for motif identification. The aligned retrotransposon sequences were analyzed using the MEGA-X. The software was used to construct a maximum likelihood phylogenetic tree with the Tamura-Nei method used to account for the substitution rate differences between nucleotides and the inequality of nucleotide frequencies. The Nearest-Neighbor-Interchange was used as the heuristic method to improve the likelihood of the tree. The phylogenetic tree generated by the MEGA-X program was then modified in the MEGA X Tree Topology Editor to produce a circular phylogenetic diagram for better data visualization.
3. Results and discussion
3.1 Multiple sequence alignment
Figure 1 shows the alignment scores of sequences produced from the multiple sequence alignment analysis performed in the clustalW program. These scores represent the pairwise alignment between each pair of retrotransposon sequences. The cutoff alignment score was set at 50 percent identity between two aligned sequences.
In total, there were 870 pairwise alignments with a 50 to 100 percent alignment score. Fifty-five percent (476) of the alignments had a percent identity in the range of 50 to 59. Thirty-two percent (281) had a percent identity in the range of 60 to 69. Seven percent (65) had a percent identity in the range of 70 to 79, 4% (35) had a percent identity in the range of 80 to 89, and 2% (13) had a percent identity in the range of 90 to 100. The multiple sequence alignment scores of 40% and higher are considered significant. However, an alignment less than 40% is considered too divergent . The alignment score for this multiple sequence analysis was set to 50% to include only highly significant alignments.
3.2 Identification of homologous sequences
Table 1 contains the aligned sequences with the highest alignment score. There is a diversity in the relationship of these sequences.
|Sequences Aligned||Aligned Score|
The results above confirm the highly conserved nature of retrotransposons. This was supported by the study of retrotransposons in mammals . Despite their enormous size and diversity, it has been noted that similar retrotransposons tend to cluster together in similar genomes of hosts belonging to the same order, family, or class . Specific types of retrotransposons belonging to the same family or lineage can be conserved across a particular kingdom or division . The presence of homologs can be inferred from these aligned sequences considering their high percent identity and their distribution to different species . An alignment of 90 and higher was used as the cutoff value for homolog identification .
3.3 Conservation of retrotransposons
Table 2 is a summation of retrotransposons sequences with an alignment score of 80 to 89. This is the pairwise alignment score between pairs of sequences.
|Sequences Aligned||Aligned Score||Sequences Aligned||Aligned Score|
Aligned sequence pairs in the same genus were:
3.4 Motifs and domains
Molecular characterization is important in understanding the nature of any genetic element and its insertion origin in a genome. Molecular characterization provides a detailed description of the structure of a genetic sequence, changes that it induces in a genome, and how it affects genetic expression . Characterization is an important feature in the study of retrotransposons. It is also used for classifying retrotransposons , uncovering their associations in a genome [65, 66], and discovering new types of retrotransposons (Table 3) .
|Reverse transcriptase (RNA-dependent DNA polymerase)||Simian taste bud-specific gene product family|
|Reverse transcriptase (RNA-dependent DNA polymerase)||Simian taste bud-specific gene product family|
|Tsi6||BAFF-R, TALL-1 binding|
|RNase H-like domain found in reverse transcriptase||Zinc knuckle|
|Tc5 transposase DNA-binding domain||GAG-polyprotein viral zinc-finger|
|Peptidase propeptide and YPEB domain||Mis6|
|Integrase zinc-binding domain||Protein prenyltransferase alpha subunit repeat|
|Integrase core domain||Chromatin remodeling factor Mit1 C-terminal Zn finger 2|
|H2C2 zinc finger||5′-3′ exonuclease, N-terminal resolvase-like domain|
|gag-polypeptide of LTR copia-type||Retrotransposon gag protein|
|Aspartyl protease||C2H2 zinc-finger|
|gag-polyprotein putative aspartyl protease||GAG-pre-integrase domain|
|Retroviral aspartyl protease||Eukaryotic translation initiation factor 3 subunit G|
|Domain of unknown function||3′ exoribonuclease family, domain 2|
|Putative peptidase (DUF1758)||HicA toxin of bacterial toxin-antitoxin,|
|Fimbrial assembly protein (PilN)||BRK domain|
The identification of the reverse transcriptase motif in these retrotransposon sequences is significant because it is not only integral to the replication process of retrotransposons but is one of the most significant parts of their structure . The reverse transcriptase type identified in these sequences was only found in LTR retrotransposons and retroviruses. The presence of this reverse transcriptase type usually indicates that the sequence is a retrotransposon mobile element or a retrovirus . Reverse transcriptase gene identification could be used to identify retrotransposon sequences due to their high specificity. Reverse transcriptases are known to be multidomain enzymes, with notable domains being the catalytic domain and the RNase H domain . The Tc5 transposase DNA-binding domain is a structural motif found in many proteins that regulate gene expression. The RNase H-like domain found in these retrotransposon sequences belongs to a reverse transcriptase subfamily that shares sequence similarity with reverse transcriptases from endogenous retroviruses of the zebrafish and the Moloney mouse leukemia retroviruses [69, 70]. This finding strengthens the viral origins of retrotransposons in eukaryotes.
The presence of the zinc-binding domain indicates the presence of integrase since it is one of the domains in the integrase enzyme. Integrase allows retroviruses and retroelements to insert their DNA into a host genome . The integrase core domain that was also detected in this sequence is one of the three known domains of the integrase enzyme. It is the catalytic domain that catalyzes the transfer of retroviral or retrotransposal DNA made by reverse transcriptase to the site in the genome where it will be inserted . GAG-Pre-Integrase domain lies upstream of the integrase region in retroviral polyproteins. They are usually connected to elements that assist in retroviral insertion .
The Copia family of retrotransposons is a large retrotransposon family active in the genomes of plants. It is classified under the long terminal repeats retrotransposons along with the Gypsy family . The GAG Polypeptide of the LTR-Copia type domain is highly conserved and found only in Copia retrotransposons . This domain was identified in seven species:
Some domains were identified that are not generally associated with retrotransposons. The Hic A toxin functions as an mRNA interferase in bacteria and archaea species , Tsi6 is a bacterial immunity protein, and the Fimbrial Assembly Protein functions in the production of bacterial fimbria used for cellular attachment . The Simian taste-bud specific gene is found in primates, and mutations of this gene have been linked to follicular lymphomas . The Mis6 protein is integral for chromosome segregation during mitosis, and the protein prenyltransferase alpha subunit repeat functions in protein prenylation. In contrast, the eukaryotic translation initiation factor 3 subunit G initiates protein synthesis , and the BAFF-R is a polypeptide that binds to the ligands of TALL-1, a tumor necrosis factor that initiates inflammation in humans . Zinger finger proteins are a large family of proteins noted for their role as transcription factors and their ability to bind Zn ions. Several of these protein types were identified from the plant retrotransposon sequences, including: H2C2 zinc finger, zinc knuckle, GAG-polyprotein viral zinc-finger, chromatin remodeling factor Mit1 C-terminal Zn finger 2, and C2H2 zinc-finger. Recent studies revealed that they are highly involved in regulating plant response to abiotic stressors in their environment . Peptidase propeptide and YPEB domain, putative peptidase (DUF1758), 5′-3′ exonuclease, N-terminal resolvase-like domain, and the BRK domain are all hypothetical proteins of which little to nothing is known of their activity presently .
3.5 Patterns and profiles
The PROSITE database has an extensive collection of protein families, subfamilies, domains, and motifs managed by the Swiss Institute of Bioinformatics . The database is organized into unique protein profiles and patterns to identify functional sites, domains, and protein families .
Table 4 contains the PROSITE patterns of four motifs found in the PROSITE database. IPNS_1 was found in
|IPNS_1||PS00185, Isopenicillin N synthase signature 1|
|ASP_PROTEASE||PS00141, Eukaryotic and viral aspartyl proteases active site|
|ZINC_PROTEASE||PS00142, Neutral zinc metallopeptidases, zinc-binding region signature|
|TONB_DEPENDENT_REC_1||PS00430, TonB-dependent receptor proteins signature 1|
Aspartyl proteases are a family of enzymes that hydrolyzes peptide bonds . They are very diverse and can be found in species including humans, retroviruses, plants, and fungi. In retroviruses, they are usually encoded in the pol gene as part of a polypeptide . The zinc protease utilizes zinc in its catalytic function to break down polyproteins. Retrotransposon’s polyproteins are very important elements of their replication mechanism, and these proteases enable the hydrolysis of these larger proteins into smaller functional polypeptides . The Pol polyproteins and proteases are needed in retrotransposon replication to form mRNA and its packaging in the transposition of retrotransposons .
Table 5 contains the four PROSITE profiles identified in the retrotransposon sequences. The Reverse Transcriptase catalytic domain profile was detected in 25 different species, the Integrase catalytic domain profile in four species, and the zinc finger CCHC-type profile, and the zinc finger SWIM-type profile in one species each. Reverse Transcriptase is a multidomain enzyme consisting of two domains: The Catalytic Domain and the RNase H binding domain. These two domains are used to perform the three enzymatic actions of Reverse Transcriptase . The Catalytic Domain carries out the polymerase activities using DNA-dependent polymerase and RNA-dependent polymerase. The RNase H domain is responsible for the ribonuclease enzymatic activity . Together, these two reverse transcriptase domains enable the “copy” part of the retrotransposon replication mechanism.
|RT_POL||PS50878, Reverse transcriptase (RT) catalytic domain profile|
|INTEGRASE||PS50994, Integrase catalytic domain profile|
|ZF_CCHC||PS50158, Zinc finger CCHC-type profile|
|ZF_SWIM||PS50966, Zinc finger SWIM-type profile|
The integrase is also a multidomain enzyme (Table 5). Its structure consists of three domains integral to its function: An N-terminal zinc finger domain, a C-terminal DNA binding domain, and the Integrase core domain between them . These integrase domains are responsible for the “paste” part of retrotransposon replication, allowing them to transpose themselves into other sites of their host genome . The CCHC zinc finger is associated with retroviruses. They are found in the capsid protein and aids the virus in host infection . The presence of this protein confirms the relationship between retroviruses and retrotransposons. They have developed from retroviruses and still retain proteins for the viral capsids and envelopes . These proteins have been repurposed from aiding in viral infection to assisting in DNA and RNA binding .
The SWIM-type zinc finger was isolated from a retrotransposon sequence of
3.6 Phylogenetic analysis
The phylogenetic analysis uses characters like nucleotide or amino acid sequences to construct a tree to show the relationship among different taxa at the molecular level. This analysis can also investigate domain relationships within an individual taxon , and this has become an essential tool for comparing genetic data between different species and groups .
The history of these retrotransposons was analyzed and created using the Maximum Likelihood method and Tamura-Nei model . The initial tree and guide tree for the heuristic search were obtained by applying the Neighbor-Joining method to a matrix of pairwise distances estimated using the Tamura-Nei model. All the codon positions included were 1st+2nd+3rd+noncoding translated proteins. The final dataset consisted of 892 positions. The MEGA X program was used to investigate relationship analyses . The neighbor-joining tree algorithm was tested with bootstrap replicates of 1000  and the resulting bootstrap values displayed above the tree’s nodes. The cutoff value for the tree branches was set at 70%  to identify lineage clusters. The largest of these clusters with values above the cutoff is the group “C,” which contained well-supported branches of retrotransposon lineages. All the plant sequences in this group were from bryophytes. Well-supported groups were group “B” (
Figure 4 shows the circular ideogram of diverse retrotransposons across range-wide orders and families of the kingdom Plantae. This ideogram was constructed to ensure holistic visualization of large-scale data and efficiently visualize enormous amounts of genomic information.
The “red” group on the upper right was represented by a cluster of retrotransposons from gymnosperms, while the “blue” group had retrotransposons originating from angiosperm. The “green” group had two novel retrotransposons, namely, Silava and Romani, distinct for gymnosperms. The “yellow” group comprises Gypsy family retrotransposons from angiosperms except for M. polymorpha and
Retrotransposons of the gypsy family tend to cluster together (Figure 4). The Gypsy family is the largest group, forming a large cluster of bryophyte sequences and eudicot sequences with few liverworts and gymnosperms sequences forming outgroups of these clades. Gypsy retrotransposons are very diversified and more widespread in plant genomes than Copia retrotransposons . Retrotransposons of the Copia family tend to be grouped based on the plant group they belong to. These retrotransposons are interspersed with novel families of retrotransposons that are Copia-like in structure. Copia-like retrotransposons are common in plant genomes and are identified by their reverse transcriptase, similar in structure to the Copia family retrotransposons . Gymnosperm retrotransposons are grouped together regardless of family, and they are associated with monocot retrotransposons. Possibly, this attribute could be the result of retrotransposal duplication events in these genomes . Notably, retrotransposons are more active in the Poaceae family , leading to the genesis of more unique and novel retrotransposon families.
Retrotransposons are such a significant part of plant genomes that they warrant more studies to understand them better. Retrotransposons were conserved in nature, tended to cluster in different plant families and classes, and revealed significant genome relationships between different families within a plant division. Retrotransposons were characterized by certain motifs and domains useful in classifying them and helping understand their role in plant genomes. Plant retrotransposons exhibited much diversification while also retaining the conservation of certain parts of their structures. Retrotransposons in plant genomes retained genes from other life domains, just as they reserved harmful genes. They can also keep useful genes essential in helping their hosts survive adverse conditions. Findings in the PROSITE amino acid patterns and profiles found that some of these plant retrotransposons contain viral, bacterial, fungal, and mammalian genes. The high specificity of retrotransposal Reverse Transcriptase could be used as an important tool in identifying retrotransposons. More so, phylogenetic analysis revealed the relationships of the retrotransposons and unveiled their diversification into several lineages. This study provided valuable information on the characteristics, patterns, profiles, diversity, and phylogenetic relationship of retrotransposons across the range-wide plant orders and families and are necessary in understanding the functions, complexity, and dynamics of plant genomes.
We would like to thank the faculty members of the Department of Biology, College of Science and Technology, Adventist University of the Philippines, and reviewers for the valuable comments. The National Center for Biotechnology Information, Bethesda MD, USA for the DNA sequences. We are grateful to Sir Owen E. Pitakia, Dr. Edwin Balila, and Dr. Lorcelie Taclan for their indispensable counsels and support.
Conflict of interest
The authors declare no conflict of interest.