Summary statistics of the sequence assembly generated from Cichorium intybus.
Radicchio (Cichorium intybus subsp. intybus var. foliosum L.) is one of the most important leaf chicories, used mainly as a component for fresh salads. Recently, we sequenced and annotated the first draft of the leaf chicory genome, as we believe it will have an extraordinary impact from both scientific and economic points of view. Indeed, the availability of the first genome sequence for this plant species will provide a powerful tool to be exploited in the identification of markers associated with or genes responsible for relevant agronomic traits, influencing crop productivity and product quality. The plant material used for the sequencing of the leaf chicory genome belongs to the Radicchio of the Chioggia type. Genomic DNA was used for library preparation with the TruSeq DNA Sample Preparation chemistry (Illumina). Sequencing reactions were performed with the Illumina platforms HiSeq and MySeq, and sequence reads were then assembled and annotated. We are confident that our efforts will extend the current knowledge of the genome organization and gene composition of leaf chicory, which is crucial for developing new tools and diagnostic markers useful for our breeding strategies in Radicchio.
- Genome draft
- marker-assisted breeding
- gene prediction
- SSR markers
- SNP calling
The common Italian name of Radicchio was adopted in recent years by all the most internationally used languages and indicates a highly differentiated group of chicories, with red or variegated leaves. Radicchio (Cichorium intybus subsp. intybus var. foliosum L.) is currently one of the most important leaf chicories, used mainly as a component for fresh salads but also very often cooked and prepared differently according to local traditions and alimentary habits . This plant species belongs to the Asteraceae family and includes several cultivar groups whose commercial food products are the leaves, namely Witloof, Pain de sucre, and Catalogne, as well as several types of Radicchio.
From the reproductive point of view, Radicchio is prevalently allogamous, due to an efficient sporophytic self-incompatibility system, proterandry and gametophytic competition favoring allo-pollen grains and tubes . Probably known by the Egyptians and used as food and/or medicinal plants by the ancient Greeks and Romans, this species gradually underwent a process of naturalization and domestication in Europe during the past few centuries. This plant has become part of both natural and agricultural environments of Italy. Currently, among the different biotypes of leaf chicories, the so-called Radicchio of Chioggia, native to and very extensively grown in northeastern Italy, is the Radicchio cultivar acquiring more and more commercial interest worldwide. In Italy, the Radicchio of Chioggia is cultivated on a total area of approximately 16–18,000 ha, half of which is in the Veneto region, with a total production of approximately 270,000 tons (more than 60% obtained using professional seeds), reaching an overall turnover of approximately € 10,000,000 per year.
Grown plant materials are usually represented by landraces or their directly derived synthetics that are known to possess a high variation and adaptation to the natural and anthropological environment where they originated from and are still cultivated. These populations are characterized by high-quality traits and have been maintained or even improved over the years by local farmers through phenotypical selection according to their own criteria and more recently by seed companies through genotypical selection following intercross or polycross schemes combined with progeny tests to obtain populations showing superior DUS scores for both agronomic and commercial traits. The breeding programs currently underway by local firms and regional institutions exploit the best landraces and aim to isolate individuals amenable for use as parents for the constitution of narrow genetic base synthetic varieties and/or to select inbred lines suitable for the production of heterotic F1 hybrids . In recent years, phenotypic evaluation trials are increasingly assisted by genotypic selection procedures through the use of molecular markers scattered throughout the genome. In fact, marker-assisted breeding allows the identification of the parental individuals or the inbred lines showing the best general or specific combining ability in order to breed synthetics and hybrids, respectively.
Radicchio, like the other leaf chicories, is diploid (2n=2x=18) and is characterized by an estimated haploid genome size of approximately 1.3 Gb. In recent years, three distinct saturated molecular linkage maps were constructed for leaf chicories, covering approximately 1,200 cM [3-5]. Its linkage groups were mainly based on neutral SSR markers, but many EST-derived SNP markers were also mapped. A method for genotyping elite breeding stocks of Radicchio, both local and modern varieties, assaying mapped SSR marker loci possibly linked to EST-rich regions and scoring PIC>0.5, was recently developed using multiplex PCRs . Here, we are dealing with a research and development project aimed at sequencing and annotating the first draft of the leaf chicory genome as we believe it will have an extraordinary impact from both scientific and economic points of view. Indeed, the availability of the first genome sequence for this plant species will provide a powerful tool to be exploited in the identification of markers associated with or genes responsible for relevant agronomic traits, influencing crop productivity and product quality. As an example, data and knowhow produced in this research project will be useful for detailed studies of the genetic control of male-sterility and self-incompatibility in this species.
The plant material that we used for the sequencing of the leaf chicory genome belongs to the Radicchio of Chioggia type, specifically to the male fertile inbred line named SEG111. This type was chosen as the most suitable accession based on the following criteria: i) the commercial relevance of the variety of origin; ii) the availability of clonal materials; iii) robust phenotypic and genotypic characterization; iv) a high degree of homozygosity (80%); and v) high breeding value as pollen parent of F1 hybrids. Sequencing reactions of the genomic DNA library were performed with Illumina HiSeq and MySeq platforms to combine the high number of reads originated by the former with the longer sequences produced by the latter. Here, we report original data from the bioinformatic assembly of the first genome draft of Radicchio, along with the most relevant findings that emerged from an extensive de novo gene prediction and in silico functional annotation of more than 18,000 unigenes. Analyses were performed according to established computational biology protocols by taking advantage of the publically available reference transcriptome data for Cichorium intybus . The main preliminary findings on the genome organization and gene composition of Radicchio are presented, and the potentials of newly annotated expressed sequences and diagnostic microsatellite markers in breeding programs are critically discussed.
2. Materials and methods
2.1. Plant materials
Plant materials used for the sequencing belong to a variety of commercial relevance of the Radicchio of Chioggia type. The clone chosen derives from the inbred line SEG111 and shows a degree of homozygosity equal to 80% . In particular, this clone was obtained by several cycles of selfing from plants yearly selected on the basis of a robust phenotypic and genotypic characterization, being also characterized by high-quality agronomic traits on farm and the ability to be easily cloned in vitro.
2.2. DNA isolation and sequencing
DNA was isolated from 150 mg of fresh leaf tissue using a CTAB-based protocol . The eventual contamination of RNA was avoided with an RNase A (Sigma-Aldrich) treatment. DNA samples were eluted in 80–100 μL of 0.1× TE buffer (100 mM Tris-HCl 1, 0.1 mM EDTA, pH=8). The integrity of the extracted DNA samples was estimated through electrophoresis in 0.8% agarose/1× TAE gels containing 1× SYBR Safe DNA Gel Stain (Life Technologies, USA). The purity and quantity of the DNA extracts were assessed with a NanoDrop spectrophotometer (Thermo Scientific, USA). Then, 1 μg of high-quality DNA was used for library preparation with the TruSeq DNA Sample Preparation chemistry (Illumina). Sequencing reactions were performed with the Illumina platforms: HiSeq (1 lane, 2 × 100 bp) and MySeq (1 lane, 2 × 300 bp).
2.3. De novo assembly and annotation
All high-quality reads generated from the two sequencing reactions were assembled in a single reference genome. Assemblies were attempted with three pieces of software: i) Velvet ; ii) SPAdes ; and iii) CLC Genomics Workbench 6.5 (Qiagen). The average coverage was estimated for the run HiSeq by calculating the frequency distribution of 25-mers .
To annotate all assembled contigs, a BLASTX-based approach was used to compare the C. intybus sequences to a subset of the NR protein collection that was made by focusing on the clade pentapetalae . Moreover, the GI identifiers of the best BLASTX hits, having E-value ≤1.0E-15 and similarity ≥70%, were mapped to the UniprotKB protein database  to extract Gene Ontology annotations  and KEGG terms  for functional annotations. Further enrichment of enzyme annotations was made with the BLAST2GO software v1.3.3 using the function “direct GO to Enzyme annotation”. The BLAST2GO software v1.3.3 [16, 17] was used to reduce the complexity of the data and perform basic statistics on ontological annotations, as reported by Galla et al. .
SSRs were detected among the 522.301 contigs via MISA . The parameters were adjusted to identify perfect and complex mono-, di-, tri-, tetra-, penta-, and hexanucleotide motifs with a minimum of 49, 13, 9, 8, 8, and 8 repeats, respectively. Repeated elements were detected with a BLASTN-based approach using a PGSB Repeat Element Database in all blast searches . The parameters set for the identification of Transposable Elements (TEs) were: reward 1, penalty 1, gap_open 2, gap_extend 2, word_size 9, dust no. An E-value cutoff of 1.0E-9 was adopted to filter the BLAST results.
Two public C. intybus transcriptomes CHI-2418 and CHI-Witloof originally developed from plant seedlings  corresponding to a wild accession of leaf chicory and a cultivated variety of witloof, respectively, were mapped to the reference genome using the CLC Genomics Workbench V7.02 (Qiagen). Mappings were performed with default mapping parameters, including mismatch cost: 2; insertion cost: 3; deletion cost: 3; length fraction: 0.5; and similarity fraction: 0.8. Non-specific matches were ignored and not included in the annotation tracks. For nucleotide variant analysis, the appropriate reference masking options were used to map transcriptome reads selectively over the sequences annotated as CDS or TEs. The variant detection analysis was done by using the Basic Variant Detection tool of the CLC Genomics Workbench V7.02 (Qiagen) with default parameters. As general filters, positions with coverage above 100,000 were not considered. Base quality filters were turned on and set to default parameters. All variants included in homopolymer regions with minimum length of 3nt, and with frequency below 0.8 were also removed from the dataset. As coverage and count filters, all variants with a minimum count lower than 20 were discarded.
3.1. Genome assembly statistics
To obtain the first genome draft of leaf chicory, a single genomic library produced from the inbred line SEG111 was sequenced using the Illumina MySeq and HiSeq platforms. Here, we report the genome assembly results derived from the CLC Genomic Workbench assembly output. Figure 1 describes the frequency distribution of 25-mers in the HiSeq data.
The data shown suggest that the average coverage in the HiSeq run is approximately 21×. Additionally, the curve indicates that a certain number of sequences are present with a relatively high frequency within the genome. This might indicate that repeated elements are relatively abundant within the genome. As a consequence, the estimated size of the assembled genome draft is 760 Mb.
We obtained 58,392,530 and 389,385,400 raw reads through the MySeq and HiSeq platforms, respectively. The de novo assembly of the two datasets in a unique reference genome draft assembled 724,009,424 nucleotides into 522,301 contigs (Table 1). The maximum contig length was equal to 379,698 bp, whereas the minimum contig length was set to 200 bp, with an average contig length of 1,386 bp. Overall statistics are summarized in Table 1.
|Total number of contigs||522,301|
|Total No. of assembled nucleotides (nt)||724,009,424|
|Average contig length (bp)||1,386|
|Minimum contig length (bp)||200|
|Maximum contig length (bp)||379,698|
The length distribution of the contig size, expressed in base pairs, is reported in Figure 2.
As much as 68.9% of the recovered sequences are contained within a length spanning from 200 nt to 999 nt. The interval length ranging between 1,000 nt and 2,999 nt is represented by 19.7% of the assembled contigs, whereas the proportion of contigs whose length is higher or equal to 3,000 nt corresponds to 11.5%.
We searched the genome sequence assembly for TEs and estimated their abundance using a BLASTN strategy. The proportion of base pairs annotated as TEs out of the total amount of assembled nucleotides was equal to 6.3% (Table 2).
|Key||Classification||Number||Abundance (%)||Length (bp)||Percentage over the assembled genome|
|02.01||Class I retroelement||273||0.19||85,241||0.012%|
|02.05||Class II: DNA Transposon||1,976||1.36||713,119||0.098%|
|02||Unclassified mobile element||861||0.59||199,301||0.028%|
|10 / 90 / 99||High Copy Number Genes and additional attributes||283||0.19||51,577||0.007%|
The retroelements were the most abundant elements (>97% of the total). Within the major class of retroelements, Long Terminal Repeat (LTR) retrotransposons proved to be the dominant class (56.55%) in the leaf chicory genome. Moreover, the Copia-type (24.61%) and the Gypsy-type (16.26%) appeared to be the most abundant LTR retrotransposons. A total of 273 (0.2%) elements were annotated as retroelements, but they lacked the assignation to a specific class based on sequence similarity and conservation. Non-LTR retrotransposons were detected to a very low extent (0.24%). Less than 2% of the total repeat elements were annotated as DNA transposons.
3.2. Discovery of SSR loci
Overall, we identified 66,785 SSR containing regions. As many as 52,186 and 11,501 sequences proved to contain one or more microsatellites, respectively. These numbers included 1,226 mononucleotide SSR motifs (which were no longer taken into account for further computations). We found a total number of di- or multinucleotide SSR motifs equaling 65,559.
The most common SSR elements were those showing a dinucleotide motif (89.0%), followed by trinucleotide (7.1%) and tetranucleotide (3.0%) ones. Microsatellites revealing a pentanucleotide and hexanucleotide motif were less than 1.0% of the total. Overall data are summarized in Table 3.
|Type of motif||Range of repeat numbers||Total No.||Percentage (%)|
3.3. Functional annotation of contig sequences
Functional annotation of the assembled contigs was performed with a BLASTX approach, according to which all contig sequences were used to query different public protein databases (Table 4).
|Public database||No. of Hits (gene models)||No. of C. intybus contigs|
The database enclosing all public protein sequences belonging to the pentapetalae clade of the eudicots, which includes the sub-clades of rosids and asterids to which leaf chicory belongs, provided a total of 38,782 hits. The proteome of Arabidopsis thaliana alone scored 16,689 hits when an E-value cutoff of 1.0E-15 was applied for the screening of the most reliable BLASTX hits.
Two public C. intybus transcriptomes originally developed from plant seedlings and provided by UC DAVIS, the Compositae Genome Project (CHI-2418 and CHI-Witloof)  were mapped to the reference genome using the appropriate mapping function of the CLC Genomics Workbench.
By doing so, we were able to map 76.5% and 78.0% of the sequences, respectively. Data derived from the mapping of two C. intybus transcriptomes were used to integrate the annotation of the assembled contigs. BLAST and mapping data integration increased the BLAST-based annotation with an additional set of 1,995 contigs.
Arabidopsis matches were used to retrieve both GO and KEGG annotations from public databases. We could finally assign one or multiple GO terms to 45,381 leaf chicory genome contigs. The analysis performed against the GO illustrate 14,073 genes annotated with terms belonging to one or multiple vocabularies. Of these, 24,634 contigs were annotated for their putative biological process, 39,118 contigs were related to a molecular function, and 37,561 contigs were associated to a specific cellular component. Figure 3 shows the fine distribution of the 14,073 hits caught by our Radicchio contigs from the TAIR database according to the aforementioned three GO categories.
Among all the terms underlined by the GO vocabulary for the biological process, our investigations were focused on terms related to the response to biotic and abiotic stresses (Figure 4), hormonal responses (Figure 5), and flower and seed development (Figure 6). Of the 15 most interesting processes for molecular breeding in leaf chicory, 7 and 8 were linked to biotic and abiotic stresses, respectively (see Figure 4). The ontological terms were assigned to 2,388 and 3,844 genome contigs, respectively.
The computational analysis for the identification of SSR elements within these contigs unveiled 495 motifs linked to biotic stresses and 841 motifs associated with abiotic stresses. Among the biotic stresses, the most abundant gene ontology (GO) term was GO:0042742, which corresponds to the “defense response to bacterium” and shows a match with 667 genome contigs containing 135 microsatellites. Concerning the abiotic stresses, the GO term assigned with the higher frequency was GO:0009651, which accounts for processes related to “response to salt stress” and matches 1,028 genome contigs containing 249 microsatellites.
Data of hormonal responses and processes of flower and seed development are reported in Figures 5 and 6. The analysis for hormonal responses noted nine different GO terms, for a total of 3,344 genome contigs, and 833 SSR elements linked to these sequences and terms. In particular, the term “response to jasmonic acid stimulus” (GO:0009753) was the most represented, with 478 matches with different genome contigs, including 118 SSR motifs (Figure 5).
Results of the GO term annotation of genome contigs according to the flower and seed developmental processes are reported in Figure 6.
The flower development process was embraced by selecting nine ontological terms, whereas three terms were assigned to seed development and seed germination. A total of 2,162 contigs were annotated with GO terms related to flower development; 496 of these were also annotated for the presence of one or multiple SSRs. In particular, the term “pollen development” (GO:0009555) was the most abundant, with 655 contigs containing 153 SSR motifs.
As far as the seed development process is concerned, we annotated 1,182 contigs linked to this GO term, 273 of which co-localized with one or multiple SSRs. Among these, the most abundant ontological term was “embryo development ending in seed dormancy” (GO:0009793) as it is assigned to 771 contigs, co-localizing with 171 SSR elements.
Using the Kyoto Encyclopaedia of Genes and Genomes database (http://www.genome.jp/kegg/), a total of 22,273 contigs enabled the mapping of 795 enzymes to 157 metabolic pathways. Among the metabolic pathways with the highest number of mapped reads, we found fructose and mannose metabolism (418 gene models matched), phenylpropanoid biosynthesis (415 gene models matched) and tryptophan metabolism (380 gene models matched). The biosynthetic pathway of flavonoid biosynthesis, described in map:00941, is relevant as the biosynthesis of flavonoid is directly connected to the synthesis of anthocyanin (Figure 7), whose accumulation contributes to the pigmentation of leaf chicories. This map includes 236 gene models that were assigned to 14 unique enzymes, including CHS (CHALCONE SYNTHASE), CHI (CHALCONE ISOMERASE), and ANS (ANTHOCYANIDIN SYNTHASE), among others.
KEGG data related to a number of selected metabolic pathways were exploited to find SSR regions potentially associated with highly valuable phenotypes in this plant species. The number of SSRs putatively linked to the most interesting phenotypic traits with breeding values in leaf chicory is displayed in Table 5.
|KEGG map ID||Metabolic pathway||Characteristic||No. of SSRs|
|map00909||Sesquiterpenoid and triterpenoid biosynthesis||Bitter taste||107|
|map00053||Ascorbate and aldarate metabolism||Vitamin C content||172|
|map00940||Phenylpropanoid biosynthesis||Leaf color||281|
|map00941||Flavonoid biosynthesis||Leaf color||173|
|map00942||Anthocyanin biosynthesis||Leaf color||180|
|map00943||Isoflavonoid biosynthesis||Leaf color||5|
|map00944||Flavone and flavonol biosynthesis||Leaf color||128|
|map00040||Pentose and glucuronate interconversions||Response to cold||96|
|map00051||Fructose and mannose metabolism||Response to cold||259|
|map00052||Galactose metabolism||Response to cold||31|
|map00061||Fatty acid biosynthesis||Response to cold||39|
|map00260||Glycine, serine and threonine metabolism||Response to cold||60|
|map00290||Valine, leucine and isoleucine biosynthesis||Response to cold||13|
|map00330||Arginine and proline metabolism||Response to cold||55|
|map00410||beta-Alanine metabolism||Response to cold||16|
|map00480||Glutathione metabolism||Response to cold||48|
|map00500||Starch and sucrosa metabolism||Response to cold||164|
|map00561||Glycerolipid metabolism||Response to cold||159|
|map00564||Glycerophospholipid metabolism||Response to cold||124|
|map00592||alpha-Linolenic acid metabolism||Response to cold||66|
|map00710||Calvin cycle||Response to cold||28|
|map00780||Biotin metabolism||Response to cold||18|
|map00960||Tropane, piperidine and pyridine alkaloid biosynthesis||Response to cold||97|
Considering the overall grouping of selected metabolic pathways, we identified many microsatellite sequences putatively linked to important traits, according to their potential effect on plant characteristics. For instance, 107 SSRs were linked to bitter taste, 172 SSRs were associated with vitamin C biosynthesis and metabolism, and 767 SSRs located in sequence contigs encoding enzymes of the flavonoid and anthocyanin biosynthetic pathways, thus potentially associated with the leaf color. The most represented characteristic is the response to cold. For this trait, we analyzed 16 different metabolic pathways that altogether led to the selection of 1,273 microsatellites potentially associated with one or multiple genes actively involved in the plant response to cold eventually, but not exclusively, through the accumulation of sugar.
We also performed the calling of nucleotide variants. Stringent quality criteria were used for discriminating sequence variations from sequencing errors and mutations introduced during cDNA synthesis. Only sequence variations with mapping quality scores over the established thresholds were annotated, leading to the identification of 123,943 and 121,086 variants that were present only in the leaf chicory transcriptome CHI-2418 (wild type) or the Witloof transcriptome CHI-Witloof (cultivated type), respectively. A total of 119,729 variants were shared by both C. intybus transcriptomes. The average number of variants per contig ranged from 9.5 to 10.5 in the two assemblies (Table 6), yielding one single variation per 100 bp in both cases.
|Radicchio CDS – 29,175 contigs||Radicchio TEs – 122,745 contigs|
|No. variants/100 bp||0.99 (1.14)||0.98 (1.14)||1.14 (2.05)||3.26 (3.64)||3.16 (3.50)||5.42 (8.88)|
The vast majority of variants were Single Nucleotide Variants (SNVs), whereas Multi Nucleotide Variants (MNVs), Insertions, and Deletions were found to a considerably lower extent (Table 6). On average, the proportion of SNVs and MNVs was comparable in the CDS and TE contigs and equal to about 90% and 5%, respectively.
Among all contigs annotated as TEs, those characterized by the presence of one or multiple variants were 10,662 and 10,651 for the two transcriptomes (Table 6). The average number of variants per contig was equal to 5.3 and 5.5. Despite the relatively low abundance of polymorphic residues in these regions, the average number of variants per 100 bp was equal to 3.3 and 3.2. Single Nucleotide Polymorphisms (SNPs) were by far the most abundant type of variants in TEs as well as in CDS regions (Table 6). In particular, transversions and transitions were on average 37% (ranging from 35.6% and 37.8%) and 63% (ranging from 62.2% and 64.4%) of the point mutations, respectively. The total number of nonsynonymous SNPs calculated with the reference transcriptomes was equal to 13,559 (10.9%) and 11,197 (9.2%) for wild-type leaf chicory and cultivated Witloof accessions, respectively.
Here, we report the uncovering of the first draft of the Radicchio genome. This highly relevant discovery was achieved by combining the recent advancement of next-generation sequencing technologies on the public side with the significant investment of financial resources in research and development on the private side.
Currently, conventional agronomic-based selection methods are supported by molecular marker-assisted breeding schemes. In recent years, we have demonstrated that the constitution of F1 hybrids is not only feasible in a small experimental scheme but also realizable and profitable on a large commercial scale (e.g., registered CPVO varieties TT4070/F1, TT5010/F1, TT5070/F1, and TT4010/F1 in progress). F1 hybrids are varieties manifesting heterosis, or hybrid vigor, which refers to the phenomenon in which highly heterozygous progeny plants obtained by crossing genetically divergent inbred or pure lines exhibit greater biomass, faster speed of development, higher resistance to pests and better adaptation to environmental stresses than the two homozygous parents. Critical steps of an applicative breeding program are the production of parental inbreds. Two highly relevant factors in this context are the selection of self-compatible genotypes, to be used as pollen donors, and the identification of male-sterile genotypes, to be used as seed parents in large-scale crosses [21, 22].
It is worth mentioning that there are several reasons why the constitution of F1 hybrids is a strategic choice for a seed company. First, the crop yield of modern F1 hybrid varieties is usually much higher than that of traditional OP or synthetic varieties. Second, the uniformity of F1 populations and the way to legally protect their parental lines allow a seed company to adopt a plant breeder’s rights, promoting genetic research and development programs that are very expensive and require many years. Finally, the need for breeding hybrid varieties also promotes the preservation of local varieties because the selection of appropriate inbred or pure lines as parents in pairwise cross-combinations requires the exploration and exploitation of germplasm resources. Our expectation is that F1 hybrid varieties will be bred and adopted with increasing frequency in Radicchio. Consequently, we invested in the sequencing and annotation of the first draft of the leaf chicory genome as it will have an extraordinary impact from both scientific and economic points of view. Indeed, the availability of the first genome sequence for this plant species will provide a powerful tool to be exploited in the identification of markers associated with or genes responsible for relevant agronomic traits, influencing crop productivity and product quality. As an example, data and knowhow produced in this research project will be capitalized on in subsequent years to plan and develop basic studies and applied research on male-sterility and self-incompatibility in this species.
The availability of high-quality sequencing platforms (i.e., Illumina) on the one hand, and specific and high-performing software for genome data assembly and gene set analysis on the other, made this project feasible. High-quality genomic DNA libraries were used for sequencing reactions performed with the Illumina platforms HiSeq and MySeq, originating a total of 197 million (mln) short reads and 29 mln longer sequences passing quality filters, respectively, which were then bioinformatically assembled to obtain the first genome draft. On the basis of this strategy, the genome draft of leaf chicory is composed of approximately 500,000 contigs, forming approximately 720 Mb. Based on the distribution of 25-mer frequencies, we estimated that the genome coverage is close to 25X. The same distribution also indicates that a significant part of the genome might be composed of highly repeated elements, as indicated by the number of k-mers that appears to be present with high frequency.
Nucleotide variant calling for the Radicchio genome showed comparable number of polymorphisms in the pairwise comparisons with the two publically available transcriptomes, originally developed from seedlings of two leaf chicory accessions (i.e., wild and cultivated types). The total number of variants discovered in the CDS regions was shown to be approximately 10 times higher than the ones found in the TEs. This result might be a consequence of low expression, or silencing, of numerous transposable elements at the level of plant seedlings, as indicated by the finding that the mapping of the two transcriptomes to the reference genome failed to align sequences to about 98% of the contigs annotated as TEs. Noteworthy, the number of variations per 100 base pairs was significantly higher in the TEs than in the CSD sequences. This result might be explained by the accumulation of mutations in noncoding sequences, as most of the TEs are.
Overall, Single Nucleotide Variants (SNVs) were the most common variants compared with In/Del mutations. Since SNP mutations very often result in silent mutations, their high proportion in the CDS regions was an expected result. In/Del mutations that usually occur in silenced or functionally disrupted genes, along with noncoding regions, were found at a low rate in CDS regions.
TEs were found to occur, at least in one copy, in the 23.50% of the 522,301 contigs that constitute our chicory genome draft assembly. Retrotransposons proved to be the most abundant elements in the Radicchio genome. This finding is in agreement with data from previous studies [23-26]. It is worth mentioning that Copia-type elements were more abundant than Gypsy-type elements, forming the predominant subclass of LTR retrotransposons.
Although the amount of TEs of the totally assembled sequences was much lower than that reported for other species, the class ratio of the TE types corresponds to that found in previous studies [23-26]. Our estimate of TEs in leaf chicory is equal to 6.28% of the contigs length, which is much lower than amounts reported for soybean (59%), pigeonpea (52%), alfalfa (27%), trefoil (34%), and chickpea (40%) [25, 27-30]. One of the reasons could be that our BLAST strategy chosen to find repeated elements in the genome was less efficient than specific software (e.g., RepeatScout and RepeatMasker [31, 32]). Another reason could be the lack of TEs in the assembled portion of the Radicchio genome due to the low complexity of these repeated DNA regions.
The BLAST strategy with the nonredundant (NR) pentapetalae protein database produced the best output in terms of similarity with our contigs. This is undoubtedly due to the availability of large collections of sequences from species taxonomically related to leaf chicory, such as Beta vulgaris, Helianthus annuus, and Lactuca sativa, among others. Unfortunately, the depth of annotation of these recently sequenced genomes is frequently not comparable to that of the long-studied Arabidopsis thaliana. Although BLAST results obtained by querying the NR database proved to be highly informative in terms of the number of hits producing alignments with significant e-value, the annotation of the leaf chicory assembled contigs was more successful when the A. thaliana database was used alone. Therefore, a possible alternative for future enrichment of the current annotation state would imply the use of software (e.g., Blast2GO) that could extract the annotation codes from multiple BLAST hits, provide the appropriate specificity cutoff, and assign the mapped GO terms to the original query.
Our choice to use the TAIR10 database to annotate our sequence contigs led to the annotation of a large number of assembled sequences and provided precious information concerning the putative process, or eventually, the metabolic pathways in which genes are putatively active.
The ability to annotate a certain number of sequences is not only exclusively dictated by the length and quality of the query sequences but also by their match with orthologous sequences that need to be annotated in depth.
This would be the case of annotations for metabolic pathways not actively studied or present in A. thaliana and for processes whose study is hampered by biological or physical circumstances. This might explain some discrepancies in annotations for male and female gametogenesis (Figure 6). From the graph, it is easy to understand the large discrepancy between the number of contigs presented for the term “Megagametogenesis” (GO:0009561), just 107, and the term “Pollen development” (GO:0009555), cited in the results as the most prevalent (more than six times that of megagametogenesis). We can suppose that this difference might not be due to a real difference in the number of genes involved in these two reproductive processes but rather to the lower number of genes known to be involved in female sporogenesis and gametogenesis.
Similarly, enzymes involved in the biosynthesis of germacren-type sesquiterpenoids, such as the germacrene-A synthase (EC:22.214.171.124), which are responsible for the biosynthesis of lactones associated with bitter taste in leaf chicory, are not known or properly characterized in A. thaliana.
Another fundamental finding of our study is the large number of SSR markers that were found in the assembled contigs. We can affirm that the leaf chicory genome shows an unexpected number and distribution of repeated sequences. Submitting our Radicchio draft to MISA software, we were able to reveal such a number of potential SSR markers. It is therefore interesting that we were able to link a reasonably large number of microsatellites to each item here presented for both GO terms and KEGG maps. In the results, we presented only a small selection of important characteristics that could be utilized in marker-assisted selection and breeding programs in Radicchio. Together with SSRs, thousands of sequences that could be used in Single Nucleotide Polymorphism (SNP) analysis were associated to fundamental biosynthetic pathways or metabolism enzymes. This is a crucial starting point for modern breeding in leaf chicories.
It is noteworthy that further studies must be conducted to determine whether and how these potential markers could be exploited in molecular breeding programs. As a final step, gene prediction and annotation were also performed according to established computational biology protocols by taking advantage of the reference transcriptome data publically available for Cichorium intybus L. These sequences allowed us to learn the number, sequence, and role of the ~25.000 genes of the Radicchio’s genome. This finding represents an important achievement for Italian agriculture genetics as a whole and opens new perspectives in both basic and applied research programs in Radicchio. It will have great impacts, potentials, and advantages in terms of breeding methods and tools useful for the constitution and protection of new varieties. Information obtained by the sequencing of the genome will be exploitable to detect and dissect the chromosomal regions where the genetic factors that control the expression of important agronomic and qualitative traits are located in Radicchio.
Modern marker-assisted breeding (MAB) technology based on traditional methods using molecular markers such as SSRs and SNPs, without relations to genetic modification (GM) techniques, will now be planned and adopted for breeding of vigorous and uniform F1 hybrids combining quality, uniformity, and productivity traits in the same genotypes.
In conclusion, our study will contribute to increase and reinforce the reliability of Italian seed firms and local activities of the Veneto region associated with the cultivation and commercialization of Radicchio plant varieties and food products; the seed market of this species will have the chance to become highly professional and more competitive at the national and international levels. To uncover the sequence of a given genome means to gain a robust scientific background and technological knowhow, which in short time can play a crucial role in addressing and solving issues related to the cultivation and protection of modern Radicchio varieties. In fact, we are confident that our efforts will extend the current knowledge of the genome organization and gene composition of leaf chicories, which is crucial in the development of new tools and diagnostic markers useful for our breeding strategies, and allow researchers for more focused studies on chromosome regions controlling relevant agronomic traits of Radicchio. In addition, conducting novel research programs for the preservation and valorization of the biodiversity, still present in the Radicchio germplasm of the Veneto region, is very important and accomplished through the genetic characterization of the most locally dominant and historically important landraces using sequenced genome information of Radicchio presented in this work.