The complex genome of Trypanosoma cruzireflects its complex life. These parasites are able to invade almost any kind of cell to freely circulate in blood or extracellular matrix, to pass through the digestive tract of its insect vector and survive after being eliminated in feces. This stressful lifestyle strongly requires a fine regulation of gene expression, which in turn is reflected on its genome organization. Although the focus of this chapter is the nuclear genome (hereinafter called generically “genome”), it is worth mentioning that Trypanosomatids have another genome that contained in their single mitochondria called kinetoplast DNA. This exhibits unique architectural and functional features: it consists of a dense network of two types of circular DNA molecules called maxicircles and minicircles. Maxicircles, of several kb in length, are equivalent to regular mtDNA of other eukaryotes, whereas minicircles are much shorter (seldom longer than 2 kb) and encode gRNAs. These are short RNA molecules responsible for guiding RNA editing, a process of posttranscriptional modifications that consists in the addition and deletion of uridines. Although editing is not exclusive of trypanosomatids, only in this group it involves massive changes in several (mitochondrialy encoded) genes.
Trypanosomes have peculiarities in transcription and genome organization that differentiate them from the majority of eukaryotes. Protein-coding genes are organized in clusters separated by relatively short intergenic regions, located on the same DNA strand  and—with a few exceptions—do not contain introns. Clusters are transcribed as long nuclear polycistronic units, and maturation implies 3′ polyadenylation—characteristic of eukaryotes—and trans-splicing, a peculiar mechanism of mRNA maturation. Trans-splicing is the process by which two RNAs encoded in different genome locations (trans) react to form a unique transcript, where the 5′ moiety contains the spliced leader sequence (~40 nt), and the rest contains the transcribed gene [2, 3]. The spliced leader (SL) is transcribed from a tandem array as a precursor of ~140 nt whose 3′ end is removed and SL inserted to an AG splice-acceptor site on a pre-mRNA molecule, through a molecular mechanism that resembles cis-splicing [4, 5, 6]. Usually polypyrimidine-rich motifs precede AG splice acceptor. Since SL-RNA is the target of capping, trans-splicing is responsible for the addition of the 7-methylguanosine cap-like (cap4) on RNAs . It has been described decades ago that this process is coupled to the polyadenylation of the 3′ end of the upstream gene, co-transcriptionally. As a consequence, a molecule of mature mRNA (capped, trans-spliced and polyadenylated) is released from the polycistron and exported to the cytoplasm, where it can be translated. Unlike other organisms, where trans-splicing also occurs, in trypanosomatids it affects almost all genes. Therefore, in trypanosomes the 5′ UTR is the sequence segment located between the SL and the start codon, whereas the 3′ UTR is defined in the same way as in other eukaryotes. With the exception of genes tandemly repeated, polycistronic units do not contain functionally related genes, and usually individual genes from the same transcription unit can show markedly different expression patterns along life cycle [1, 3]. Gene expression in trypanosomes is regulated mainly at the posttranscriptional level, and numerous studies have shown the relevance of 3′ UTR regions in regulation, affecting mRNA stability or translation, and hence differential expression [3, 8]. Different elements in the 3′ UTRs together with the presence of a high number of RNA binding proteins could explain, at least in part, differential expression [9, 10, 11], although the exact mechanisms allowing gene specificity are still unknown.
An important issue that still is not clear is whether T. cruziconstitutes a single species or a complex of species. Initially two groups of T. cruziwere described (I and II) based on biological and biochemical criteria as well as molecular techniques . The first study using molecular phylogeny (sequences of coding genes) clearly showed that at least three major lineages (A, B and C) were present in this parasite , and that the distances between these groups are equivalent to the distance between different species of Leishmania. Currently six groups or discrete typing units (DTUs) named TcI-TcVI were proposed , and T. cruziisolates from bats were included as a seventh DTU [15, 16]; where TcV and TcVI are hybrid lineages derived from haplotypes TcII and TcIII . However, the high biological and genetic diversity of the T. cruzistrains, even at the intra-DTU level, indicates that DTUs constitute a useful working definition, but not a definitive classification. The new era of genomic studies through next generation sequencing (NGS) is providing new insights on the above-mentioned unsolved questions.
2. Genome organization
In T. cruzimitosis occurs without a complete disruption of nuclear envelope. In addition, although nucleosomes are present, chromatin does not condense up to chromosomes, so they cannot be visualized by microscopy. This feature has made classic cytogenetic studies unsuitable for these parasites. Instead, T. cruzikaryotype has been determined by molecular biology techniques, mainly pulsed field gel electrophoresis (PFGE) in combination with Southern blot [17, 18, 19]. Early studies revealed complex chromosomal patterns, evidenced by different PFGE profiles among strains, and allowed to infer that T. cruziwas at minimum diploid . Size of chromosomal bands ranges from 0.45 to 4 Mb, without minichromosomes, and the number of chromosomes was estimated mainly through probes used as genetic markers. Depending on the probes and PFGE conditions, chromosomes ranged between 19 to 40 per haploid genome, showing that T. cruziis mainly diploid, although the sizes of homologous chromosomes can differ significantly [17, 18, 19, 21, 22, 23].
A milestone was achieved in 2005 when the draft genomes of L. major, T. brucei, and T. cruziwere simultaneously published and referred as to the “TriTryps” [24, 25, 26]. This opened a new era in biology research on these parasites. A distinctive feature in T. cruziwas the already known highly repetitive nature of its genome (50%): in fact 5–10% of the genome is composed by the 195 bp satellite, and the rest of the repetitive DNA is composed of multigene families, tandem repeats and retrotransposable elements . This feature gave rise to a highly fragmented assembly, resulting in that chromosome number and structure or, at least large contigs, could not be obtained. Attempts to recover full length chromosome sequencing, used a combined strategy based on synteny maps with T. bruceichromosomes and BAC ends sequencing. By this means 41 virtual chromosomes were obtained for the hybrid CL-Brener strain. Although this strategy represented a substantial improvement in comparison to previous versions of the genome, the issue of assembly fragmentation remained as a limitation for diverse types of analyses that require high precision. A recent milestone in the area was the first publication of long read sequencing of two T. cruzigenomes (Dm28c and TCC strains), which allowed to circumvent the limitation of high fragmentation imposed by the Sanger method , as well as by short reads NGS methods. Using this approach, also described for Bug strain , contigs of more than 1 Mb were obtained, probably covering whole chromosomes, but fragmentation still persists in some regions of the genomes. The exact number of chromosomes and their organization will be finally achieved through the combination of long read sequencing methods, optical maps techniques and polymer-based modeling, a field that has undergone a dramatic acceleration in the last decade .
Although PFGE and fluorescence cytophotometry were useful methods to depict the complex variability of T. cruzikaryotypes, it was not until the advent of next generation sequencing techonologies (NGS) that ploidy—or chromosomal copy number variation (CCNV)—analyses could be studied more in detail. Aneuploidy, the gain or loss of chromosomal copies, is of particular importance since it gives clues about the relevance of genome plasticity in the context of parasite fitness. This phenomenon has been detailed studied in Leishmania spp., whose “mosaic” aneuploidies—ploidy variations within isolates from a strain and even between individual cells from a population – were related to drug resistance, regulation on gene expression, or host adaptation [31, 32, 33]. On the contrary, PFGE, fluorescence cytophotometry and high-throughput sequencing data analyses agreed on the ploidy stability of T. bruceiand its subspecies: T. b. brucei, T. b. gambienseand T. b. rhodesiense[34, 35, 36]. Remarkably, a field isolated T. congolensetriploid was reported, suggesting that Salivarian evolutionary lineage species, such as T. bruceiand T. congolense, can sustain euploidies but not massive aneuploidies .
In T. cruzi, since CCNV analysis deeply depends on high quality, chromosome-level assembled reference genomes, it was extremely difficult to implement. However, in spite of this limitation, some approaches were done using CLBrener genome as reference . Taking into account the poorly assemble reference genome at that moment, and the repetitive nature of T. cruzigenome, only reads with high mapping quality were used in CCNV estimations. The single-copy genes ploidy estimation (SCoPE) was the methodology utilized by the authors. In this methodology, estimation of chromosomal somy is based on the ratio between the mean coverage of all single-copy genes (unique genomic sequences) in a given chromosome and the genome coverage. After including several T. cruzistrains from different DTUs, authors proposed that—as was observed in Leishmania—the aneuploidy pattern varies among and within T. cruzilineages. In addition, as observed with PFGE, CCNV is considerably frequent between T. cruzistrains, including those within a same DTU. Authors propose that TcI appears to be more stable, and TcII had large differences between strains, suggesting that this mechanism is widely used by the parasite to expand groups of genes . Nevertheless, unlike L. donovani, CCNV on T. cruziseems to be stable on parasite population, at least for TcII analysis on Y strain and derived clones .
2.3 Genome size
The genome size of T. cruzihas been estimated by different methodologies such as flow cytometry, renaturation kinetic analysis, microfluorometry, chemical analysis, molecular karyotyping and genome sequencing. Every approach agreed on that T. cruzigenome size is variable. Polymorphism has been shown between DTUs, between strains within the same DTU, and even between isolates from the same strain [40, 41, 42, 43, 44, 45, 46, 47]. From a wide genome size quantification and analysis including more than fifty strains from DTUs TcI to TcVI  it was found that: (i) maximum difference observed between strains was 47.5%; (ii) TcI was the smallest genome, (iii) TcV and TcVI were the least variable, (iv) parental genomes mean gene content (TcI: 88.4 Mb, TcII: 106.5 Mb, TcIII: 119.2 Mb), and similar results on the reduced size of TcI, with few exceptions was further observed .
Genome size estimation by bioinformatic analysis of NGS data, as was mentioned before, is hampered due to the massive presence of repetitive sequence regions, which reach up to 50% of the genome [25, 48]. This generates assembly fragmentation and collapse—gene and repetitive sequences, leading to copy number underestimation—which represents a challenge to the correct genome size estimation. In fact, as reflected on Table 1, the assembly size is far below the estimations made by DNA measurements methods. Only third generation sequenced genomes appear to represent more accurate figures [28, 29].
|Strains||DTU||Size (Mb)||Contigs||N50||L50||GC%||Genes||Proteins||Sequencing plataform||References|
|Dm28c||TcI||53,3||636||317.638||47||51,6||18759||15319||PacBio + illumina|||
|TCC||TcVI (hybrid)||87,1||1.236||264.196||92||51,7||29109||24191||PacBio + illumina|||
|CL Brener||TcVI (hybrid)||89,9||29.495||88.624||212||51,7||23696||19607||Sanger|||
|Sylvio X10/1||TcI||38,6||27.019||2.307||2599||51,2||10861||10847||454||[53, 84]|
|Trypanosoma cruzi marinkellei|
|B7||--||34,2||23154||2846||2511||50,9||10117||10104||454 + Illumina|||
Genomes of Trypanosoma cruzi
3. Genome architecture and composition
The publication the of first T. cruzigenome in 2005  was a cornerstone of the study of its genome complexity. Although the CL-Brener sequenced strain turned out to be a hybrid that made the analyzes more arduous, at that time it was corroborated that more than 50% of the genome of T. cruzicorresponds to repetitive sequences—mainly retrotrasposons, multigenic families and tandem repeats—including the discovery of the new gene family of a new family of mucin associated surface proteins (MASP). Around 12,500 genes could be identified, but the assembly was fragmented into more than 5400 scaffolds (ordered contigs usually joined with unknown sequences filled as “N”), and the complete sequence of the genome was not obtained, being the total genome size about 67 Mb (half of it corresponding to each haplotype). Later on, based on the scaffolds already defined , BAC ends sequencing and synteny maps with T. brucei, it was possible to recover full length pseudo-chromosomes , although still maintaining thousands of sequences as “unassigned contigs.” Since these initial publications, several T. cruzigenomes have been reported to be sequenced by NGS, although massive sequencing could not improve the low resolution in complex and highly fragmented regions (Table 1 and cites therein).
The advent of long read sequencing technologies helped to tackle part of the assembly fragmentation issue, and to better understand T. cruzigenome complexity. In 2018 the genomes of two T. cruzistrains (Dm28c and TCC, belonging to TcI and TcVI respectively) were sequenced by using Pacbio technology, showing substantial improvements: assemblies of Dm28c and TCC were of 53.2 and 86.7 Mb distributed in 599 and 1142 contigs, respectively, which implied a high reduction of fragmentation  (see N50 stats, Table 1). Completeness of these genomes was achieved, obtaining for the case of Dm28c all its haploid genome, totaling 53.3 Mb. This size is consistent with the most precise estimations made by fluorescent nucleic acid dye . For the hybrid strain TCC, composed of two relatively divergent parental lineages, it is assumed that the diploid size that includes both parental haplotypes should be recovered, i.e., 106–122 Mb for TCC [46, 47], which compared with the 86.7 Mb indicates that segregation cannot be achieved in those regions with high identity. The ability to separate haplotypes opens new possibilities for the study of the evolutionary processes that occurred in T. cruziand can be useful to provide insights on how hybrids were generated and evolved. Moreover, recombination events can be identified and studied . The hybrid strain Bug 2148 (TcV) was recently long-read sequenced and assembled in 934 contigs, also resolving the fragmentation in a large degree; although the expected genome size is 106–135 Mb [46, 47], the total assembly size is 55.2 Mb and it is striking that there is no evidence of haplotype separation  as would be expected for a hybrid strain.
Even using this new technology, these assemblies still have some fragmentation mainly due to the size of the tandem repeats. In particular, the well-characterized 195 bp satellite that can reach clusters of 50 kb, contributes as a major factor to assembly fragmentation avoiding its complete resolution [50, 51, 52]. In fact, these genomes contain several contigs entirely composed of this repeat, which together encompasses more than 5% of the genome (see below).
3.1 Genome compartments and gene composition
Since genomic annotation, especially in T. cruzi, is arduous and often the goal of genomic sequencing escapes the annotation, it has not been performed in all genomes. For those genomic projects of T. cruzithat have the annotation (see Table 1) quite similar number of coding genes per haplotype was determined, a minimum of ~10,800 for Sylvio  and a maximum of 15,300 for Dm28c . These genes can be divided into two large groups, those of well conserved core genes, and those coding for the multigenic surface families, several of which are unique for T. cruzi(see below). In fact, the improvements in the assemblies allowed us to determine that the genome of T. cruziis composed of two compartments. These compartments, called “core” and “disruptive”  vary in gene content and nucleotide composition. The “core compartment” is composed of conserved and hypothetical conserved genes, it has a lower GC content (48%) and exhibits synteny conservation with T. bruceiand L. major, whereas the “disruptive compartment” is mainly composed by the surface multigene families trans-sialidase, MASP, and mucins, and exhibits a higher GC content (53%).
3.2 Gene organization
As mentioned, genes in trypanosomatids are organized into non-overlapping clusters on the same DNA strand with unrelated predicted functions. Genes are transcribed as polycistrons and subsequently trans-spliced and polyadenylated. In T. cruzigene clusters can range from ~30 to 500 kb separated by divergent or convergent strand-switch regions (SSR) . Although no evidence of shared consensus motive or patterns has been found among them, the SSR are functionally active. For instance, transcription initiation and termination take place [2, 55, 56], but it is also observed that they are involved in the origin of DNA replication , and centromeric function [58, 59]. The SSRs exhibit some properties such as a different composition in comparison to the rest of the genome and higher intrinsic curvature [60, 61], associated in turn with transcriptional regulation. Indeed SSRs from the disruptive compartment are longer than those from the core compartment (mean length ~4.5kb and ~1.5kb respectively).
4. Trypanosoma cruzirepetitive genome
One of the outstanding features of the T. cruzigenome is its repetitive nature. Three types of sequences contribute to this characteristic: multigenic families, retrotransposons and satellite DNA (tandem repeat sequences).
4.1 Multigene Families
A main characteristic of T. cruzigenome is the large number of multigene families, many of them having hundreds of members. The largest families in T. cruzigenome are shown in Table 2. TS, Mucins and MASP are located in the disruptive compartment of the genome, whereas GP63, DGF-1 and RHS are distributed in both compartments . We will focus on families from the disruptive compartment (MASP, Mucins and TS), and GP63 as an example of a very expanded family in T. cruzi. It is noteworthy that these families code for proteins directly involved in interaction with the host, both at the cellular level (adhesion, invasion, infection) and in immune modulation responses, mainly because most of TS, Mucins, MASP and GP63 proteins are GPI anchored, i.e., they are constitutive part of the functionally relevant cell surface of T. cruzi.
Gene families groups in T. cruzi.
Trans-sialidases and trans-sialidase-like proteins (TS) constitute a large and polymorphic superfamily [25, 28, 29] whose name comes from the ability to transfer sialic acid from host glycoconjugates to parasite’s mucins [70, 62]. This activity is highly relevant since T. cruziis unable to synthesizes sialic acid de novo, and sialic acid containing glycoproteins are demonstrated to be relevant for infection [70, 62, 63]. However, only a very few members of TS family are predicted to be enzymatically active , whereas the rest of them have other relevant roles such as binding to host molecules, immunomodulation, apoptosis or invasion . It should be very important to rename this family since its current denomination leads to confusion. The hallmark of the family is the presence of the canonical amino acid motif VTVXNVXLYNR, although some members have a degenerated version of it . TS proteins can be secreted or membrane anchored, in which case they exhibit an N-terminal signal peptide and GPI signal sequence at the C-terminal region of the protein. Genomic analysis of TS gene family in CL-Brener revealed that TS family was clustered in eight groups, which were classified by the presence or absence of additional motifs like FRIP, Asp box and the SAPA [65, 66]. In this classification the Group I is defined as those sequences with a predicted enzymatic activity, which corresponds to 4% of the total TS genes . By long read sequencing, a more precise gene copy number could be determined on TCC and Dm28c strains: 1734 and 1491 TS genes respectively; with these new protein sequences the classification should be updated. Draws the attention that both strains exhibit a substantially high percentage of pseudogenes: 41.6% in TCC and 38% in Dm28c , which suggest that they could not constitute “inert material.” This point deserves further studies to determine if pseudogenes are expressed, and/or can constitute a source of variability, among their possible functions. Most of TS genes are overexpressed in trypomastigotes, but a small percentage are upregulated in amastigotes or epimastigotes at the transcriptional level .
Mucins and mucin like glycoproteins are the main acceptors of sialic acid through the trans-sialidase TS activity , and participate in adhesion, protection against lysis, invasion and immune evasion . The first mucin-like gene cloned and the predicted protein exhibited an internal tandem repeat with the canonical sequence T8LP2, flanked by an N-terminal signal peptide and a C-terminal GPI anchor signal sequence. Further studies revealed the presence of a complex family with genes coding for proteins with similar N and C termini but with non-repetitive, variable and serine and threonine rich domains, also classified as mucins. Those groups with repetitive domains and without repetitive domains were designated TcMUCI and TcMUCII , and the presence of a mosaic sequences between both groups led to the proposal of a common ancestor and further diversification . Another group of smaller mucin genes, TcSMUG , are expressed in the insect stages, and were subclassified in large and small TcSMUG (L and S) . Due to the complexity of this family manual curation is needed for annotation of these genes. Our group used the following criteria: genes exhibiting an N-terminal signal peptide, a C-terminal GPI anchor signaling, and T rich sequences such as T8KP2, T6-8KAP or T6-8QAP, finding 1018 and 574 mucin genes in TCC and Dm28c respectively , and around 20% were classified as pseudogenes in both strains. Regarding the expression of TcMUC and TcSMUG in life stages of Trypanosoma cruzi, trypomastigotes presented higher expression levels of both TcMUC groups, and in contrast with biochemical reports , in amastigotes the highest expressed mucins belong to TcMUCII instead of TcMUCI .
One of the most surprising result after assembly and annotation of the first T. cruzigenome , was the discovery of a new gene family composed of approximately 1300 genes, and named as mucin associated surface protein (MASP), because of their location in proximity or clustered with mucin genes. MASP family is characterized by conserved N-terminal signal peptide, a conserved C-terminal domain containing a GPI anchor addition site, and a variable central region . One of the proposed roles of this gene family is the immune system evasion during the acute phase of Chagas disease . CL-Brener clone contains 1377 maspgenes, among which 771 appear to be intact genes and 433 (31%) are pseudogenes , and analysis in Dm28c and TCC yield similar results: 1045 and 1332 genes where 36 and 33% respectively are pseudogenes . Regarding the expression of this gene family, 97% of maspgenes are upregulated in trypomastigotes, and a discrete number of genes are expressed specifically in amastigotes or epimastigotes .
GP63 are GPI anchored metalloproteases present in the Tritryps. However, whereas L. majorcontains six gp63genes and T. bruceihas thirteen copies, in T. cruzithis family is widely expanded: 400 genes or pseudogenes were identified in CL-Brener  and Dm28c , and more than 700 in TCC . Strikingly, more than 60% of these genes on the three strains are annotated as pseudogenes. Although the role of this family in innate immune evasion and invasion, has been extensively studied in Leishmania[73, 74], little is known about its role in T. cruzi. The reason of the expansion of this gene family in the T. cruzigenome remains to be elucidated as well as its role on this parasite. Transcriptomic analysis revealed that most of the members are highly expressed in trypomastigotes, whereas a few genes are expressed almost exclusively in amastigotes. Interestingly, phylogenetic analysis using 3′ UTR sequences of gp63genes showed three groups of sequences clearly distinguished; one group associated with genes highly expressed in trypomastigotes, another one with genes highly expressed in amastigotes, and a third group of genes with almost no expression in any stage of the parasite . This result strongly supports a major role of the 3′ UTR in posttranscriptional regulation of this family that deserves further studies.
4.2 Transposable elements
Transposable elements (TEs) are repeated DNA sequences, which have the ability to move from one to another locusin the genome. This was why they have been referred to as “junk” DNA, selfish sequences or genomic parasites. However, growing evidence is indicating the great importance that TEs play in the evolution of genes and genomes in a wide range of organisms, including trypanosomatids [75, 76]. T. cruzigenome lacks class II elements (DNA transposons), bearing only class I retroelements. Within them—according to Wicker  TEs classification—T. cruzipresents three autonomous families: VIPER, a tyrosine recombinase (YR) element which belongs to the DIRS order; L1Tc, a non-LTR element of the ingiclade; and CZAR, also a non-LTR element from the CRE clade which is site-specific, inserting only on the SL gene [25, 76, 78]. On the other hand, non-autonomous elements have been also identified. SIRE, have similarity with the VIPER 5′ and 3′ ends, resembling what nowadays are called solo-LTR. NARTc is the non-autonomous couple of L1Tc elements, as has been classically described for LINE/SINE-like couples. Finally, TcTREZO has been described as another site-specific retroelement, inserted within maspgenes . Although it has been characterized as a non-LTR retroelement due to the presence of a poly-A tail and a secondary structure which will be promoting its retrotranscription, no conserved domains have been detected on this element. Hence, TcTREZO could be an ancient non-autonomous retroelement. All of the VIPER, CZAR and TcTREZO copies are defective (no complete domains where found), whereas L1Tc was the only one which showed putative active copies .
4.3 Tandem repeats
Although NGS platforms implied an enormous progress for our knowledge about genomes composition and evolution, tandem repeats were not that benefited. Tandem repeats are commonly classified in micro, mini and macro-satellite, depending on their monomer or cluster length. Microsatellites are those whose monomers are from 2 to 5 bp, minisatellites from 15 to 100 bp, and finally macrosatellites or just called satellites involves repeats greater than 100 bp . Even with very deep genome coverage, short read lengths cause problems for de novoassemblies, especially in tandem repeat rich regions. Due to this trouble, tandem repeats can be considered as neglected sequences in the majority of genome analyses. Although great efforts were done, fragmentation of the genome assembly occurs frequently where repeated sequences are located. In fact, the massive major 195 bp satellite (TcSAT1 named in repbase) described for the first time by Sloof et al. , represents a huge challenge for contig assembly. Although PacBio reads enable to develop an improved assembly and characterization of tandem repeats characterization and assembly, the size of some clusters exceeds that of the reads. In fact, some small-size contigs (50 kb) are composed entirely by the 195 bp satellite sequence.
In summary, genomic studies are essential for understanding Trypanosoma cruzibiology, and the new technologies will give responses to still unanswered questions: Which molecular mechanisms allow to regulate specific genes, without consensus sequences? Is Trypanosoma cruzia unique species? How many chromosomes do they have? How are chromosomes organized? Which role plays the highly repetitive genome on its plasticity? And we can continue, to reinforce the idea that much remains to be done.
This work was supported by Institut Pasteur de Montevideo (S.P. postdoctoral fellowship) from UK Research and Innovation via the Global Challenges Research Fund under grant agreement ‘A Global Network for Neglected Tropical Diseases’ grant number MR/P027989/1. LB, APT, FAV and CR are members of the Sistema Nacional de Investigadores (SNI-ANII, Uruguay).