Genetic Diversity of Coffea arabica L.: A Genomic Approach

Coffea arabica L. produces a high-quality beverage, with pleasant aroma and flavor, but diseases, pests and abiotic stresses often affect its yield. Therefore, improving important agronomic traits of this commercial specie remains a target for most coffee improvement programs. With advances in genomic and sequencing technology, it is feasible to understand the coffee genome and the molecular inheritance underlying coffee traits, thereby helping improve the efficiency of breeding programs. Thanks to the rapid development of genomic resources and the publication of the C. canephora reference genome, third-generation markers based on single-nucleotide polymorphisms (SNPs) have gradually been identified and assayed in Coffea , particularly in C. arabica . However, high-throughput genotyping assays are still needed in order to rapidly characterize the coffee genetic diversity and to evaluate the introgression of different cultivars in a cost-effective way. The DArTseq™ platform, developed by Diversity Arrays Technology, is one of these approaches that has experienced an increasing interest worldwide since it is able to generate thousands of high quality SNPs in a timely and cost-effective manner. These validated SNP markers will be useful to molecular genetics and for innovative approaches in coffee breeding.


Introduction
Coffee is an important crop and the second most traded commodity in the world (after petroleum) providing a living to more than 125 million people.Commercial coffee production is controlled by only two species belonging to the Coffea genus: Coffea arabica L. (Arabica coffee) and Coffea canephora Pierre ex A. Froehner (Robusta coffee), which supplied 60 and 40% of the world coffee production in 2018/19, respectively [1].Although C. canephora does not have the cup quality of the more popular C. arabica, it continues to be widely grown, especially in regions where farming is low intensive because of its tolerance to diseases and pests as well as abiotic stresses [2].
C. arabica produces a high-quality beverage, with pleasant aroma and flavor, but a range of biotic and abiotic stresses often affect its yield [3]. Therefore, improving important agronomic traits of both commercial species remains a target for most coffee breeding programs.Advances in genomic and sequencing technology, make possible to understand the coffee genome and the molecular inheritance underlying coffee traits, thereby helping improve the efficiency of coffee breeding [3].
The development of new genomic tools can help us explore, more deeply and more precisely, the genomic diversity at intra and inter-specific levels [4].Two examples of high-throughput platforms include next-generation sequencing (NGS) [5] and the development of DNA microarrays [6].Compared to a whole-genome sequencing methodology, an SNP array approach provides time-effective, low-cost and more straightforward genotyping technology for germplasm screening [7,8].
Thanks to the rapid development of genomic resources and the publication of the reference genome [9], third-generation markers based on single-nucleotide polymorphisms (SNPs) have gradually been identified and assayed in Coffea, particularly in C. arabica [10,11].

Genetic diversity of Coffea arabica L.
The Coffea genus belongs to the Rubiaceae family that includes around 124 species, most of them are diploids (2n = 2x = 22).The only allotetraploid is C. arabica L. with 2n = 4x = 44 [12], which was originated from the natural cross between Coffea eugenioides S. Moore and C. canephora Pierre ex A. Froehner [13], C. arabica is the only self-fertile among the other cultivated species.This specie is genetically less diverse when compared to the diploid species [14,15], a situation that has been associated with its susceptibility to the common coffee diseases [16].
C. arabica is mainly native to the highlands of southwestern Ethiopia, South Sudan (Boma plateau), and north Kenya (Mount Marsabi).C. arabica cultivars grown all around the world are derived from either 'Typica' or 'Bourbon' genetic base [17].Studies report wide agronomic diversity of Arabica coffee accessions collected in these regions of Ethiopia regarding leaf size, height, biotic and abiotic stresses tolerance and yield [18,19].In addition, studies using molecular markers indicated the presence of higher genetic variability of Ethiopian (ET) accessions compared with cultivars, demonstrating the potential of these accessions for breeding purposes [10,[20][21][22][23].These accessions also showed a great variability of metabolite profiles contents of coffee beans for cup quality improvement [10,24].
The assessment of population structure and genetic relationships of these ET accessions, among themselves and in relation to traditional cultivars is fundamental for efficient use of genetic diversity of these genotypes in Arabica coffee breeding programs [25].However, selection of genetically diverse parental lines based on morphological and agronomic traits is often difficult because of a high degree of morphological similarities [26].
During the past 30 years, molecular markers have been increasingly used in germplasm diversity assessment of various crops [27,28].The molecular information allows gaining insight into the genetic structure of individual genotypes, and eventually helps in accurate selection of superior genotypes for maximizing selection gains [29].

C. arabica diversity assessment by molecular markers
Several works on the assessment of Arabica genetic diversity have been carried out with different results.Generally, among different types of material (cultivars, accessions, hybrids, and spontaneous genotypes) practically all studies show a very low genetic variation by using different marker systems [3].Arabica's genetic diversity has been evaluated by a range of molecular markers, such as Random Amplified Polymorphic DNA (RAPD) [30,31], Inter Simple Sequence Repeat (ISSR) [32], Simple Sequence Repeat (SSR) [23,29,33,34], SSR and Amplified Fragment Length Polymorphism (AFLP) [35,36].
In a recent study presented in the World Coffee Research annual report a genetic diversity assessment of 800 Arabica's accessions from the collection at CATIE, Costa Rica, shows the least genetic diversity of C. arabica compared to other major crops [37].This study also found that coffee cultivars contain almost 45% of the genetic diversity found in the 800 above-mentioned accessions indicating the limitation of variability for breeding programs [3].Therefore, it is crucial to assess the population structure and its genetic diversity in Coffea genus.
Of course, all C. arabica germplasm available in ex situ collections may represent only a fraction of the total genetic diversity of the remaining wild and semi-wild forest coffees in S.W. Ethiopia [38].However, Arabica's breeders do have already an idea of the potential and limits of ET germplasm, in particular in regard to host resistances to diseases and pests.For example, none of the modern Arabica cultivars with host resistances to CLR derive from these ET germplasm [39].Also cultivars resistant to CBD outside Ethiopia do not have ET germplasm as progenitors [40], while nematode resistance found in ET accessions provide only limited protection to the severe nematode problems in Central America [41].
In contrast, ET germplasm may be a good source for sensory quality traits in cup.The cup quality profile of the new Arabica's F1-hybrids developed for Central America is said to derive largely from one of the two progenitors, being a selected ET accession of the FAO-1964 pool [42].Silvarolla et al. found three coffee plants in offspring of ET germplasm, which were nearly caffeine-free [43].Male sterility has been detected in a few ET accessions, a character useful for F1-hybrid seed production [44].

Next generation sequencing techniques in C. arabica
NGS incorporate technologies which, at low cost and in short time, produce millions of short DNA sequence.The most commonly used platforms for highthroughput, useful genomic research, especially in non-model plant species include second generation sequencing techniques (SGseqTs): Illumina/Solexa, 454/Roche, ABI/SOLiD, and Helicos (read mostly in the range of 25 and 700 bp in length) [45].Results obtained from such research point to the fact that NGS techniques (NGSTs) should not be restricted to the genomes of model organisms only as non-model plants have provided useful resources for genomic studies [45].
In contrast to classical molecular markers, SNPs are the most abundant markers, particularly in the non-coding regions of the genome [46].NGS used jointly with different complexity reduction methods, Genotyping by sequencing (GBS) and DArTseq™ (Sequencing-based diversity array technology) methods, enable a largescale discovery of SNPs in a wide variety of non-model organisms [47][48][49].These techniques provide measures of genetic divergence and diversity within the major genetic clusters that comprise crop germplasm [50].
Although significant, the number of reports concerning genomic resources in Coffea, even for a specie of commercial importance, such as C. arabica, is still low.Already, genotyping profiles of SNPs were identified and tested in C. arabica by Moncada et al. [57], Sousa et al. [29], Sant'Ana et al. [10] and Merot-L'anthoene et al. [4].High-throughput genotyping assays are still needed in order to rapidly characterize the coffee genetic diversity and to evaluate the introgression of different cultivars in a cost-effective way.Measures must be taken to construct high-density genetic maps in Coffea [57,58].However, the use of SNP markers to generate denser maps is still low.

DArTseq™: an effective tool for genome diversity in C. arabica
The DArTseq™ technology, developed by DArT company (https://www.diversityarrays.com), is one of those methods that have received increasing interest worldwide since it can generate thousands of high-quality SNPs in a timely and cost-effective manner [59,60].The DArTseq™ method, a variation of GBS, implements complexity reduction methods that effectively targets low-copy sequences of the genome [61].Besides, this process is optimized for each organism and type of study, by using combinations of restriction enzymes (REs) and selecting the most effective in reducing genome complexity [59].
The DArTseq™ technology has been utilized in diploid but more often in polyploid plant species, such as rice (Oryza sativa; [62]), barley (Hordeum vulgare; [63]) and maize (Zea mays; [64]), because SNP detection is facilitated by high fidelity REs, rather than relying on the annealing of primers to genomic targets in the presence of homologous annealing sequences [65].
In coffee, we have reported a genetic diversity study in 87 accessions of Coffea spp.These accessions were selected from the National Coffee Germplasm Bank located at 19°10′ 27" N and 96° 57′ 50" W and 1345 masl, in Huatusco, Veracruz, Mexico.Accessions were previously characterized by DArTseq™ method and SNP markers in Spinoso-Castillo et al. [66].
As a result, 16,995 SNP markers, derived from 34,000 unique sequences, were obtained by DArTseq™ from 87 accessions of different Coffea spp.After removing the markers with more than 10% of the missing data and MAF <5%, there were 1,739 polymorphic SNP markers for the analysis.After imputation and elimination of markers based on MAF, a heat map of the 87 accessions was obtained by using the genomic relations matrix G (Figure 1).
For the heat map, the genomic relations matrix G can be easily calculated using the following expression: where Z is the matrix of markers of dimension n = 87 rows (individuals) and p = 1,739 columns (markers), which is obtained by centering and standardizing the columns of the matrix of markers.The model-based Bayesian cluster analysis in STRUCTURE visualized the population structure under examination (Figure 2).Five distinct sub-populations were found across cultivars.
The results obtained from this Coffea spp.central collection are similar to those reported in the study of Sant'Ana et al. [10] who found in the population structure analyses the presence of two to three groups (K = 2 and K = 3), corresponding to the east and west sides of the Great Rift Valley and an additional group formed by wild C. arabica accessions collected in the western forests.Sousa et al. [29] analyzed the population structure of coffee genotypes of interest for breeding studies, they used 11,187 SNP markers from which two groups (K = 2) were obtained.

Advantages and disadvantages of NGS techniques in C. arabica genomics
High quality reference genome assemblies accelerate plant breeding by selecting desirable genes with improved agronomic traits, including high yield, tolerance to  various abiotic and biotic stresses, and resistance to pathogens [68].However, draft genomes are suffering from unknown sequences and ambiguous assembly due to homologous sequences, while high-quality genomes are required for comparative genomics and functional annotation to crop improvement [68,69].
These NGSTs are classified as second and third generation.The success of these NGSTs is mainly due to advancement in nanofluidics and automated single molecule imaging [69].SGseqTs refer to those methods which require a PCR step for signal intensification prior to sequencing and third generation sequencing techniques (TGSeqTs) are those which can perform single molecule sequencing (SMS) [70].
As an advantage, in SGseqT the variation is different in their sequencing chemistry, cost, accuracy, speed and read length; SGseqTs produce thousands to billions of nucleotide long reads (25-800 nucleotides) as compared to first generation sequencing method [69,70].However, as a disadvantage, the accuracy of SGseqTs differs due to dependence on several multiplication steps during library preparation, each manipulation causes various artifacts in DNA measurements; additionally, the small reads produced by these procedures are not suitable for de novo genome assembly [69,70].
Therefore, novel technologies are being designed in such a way that involve a minimum or no manipulation of the natural DNA molecule; TGSTs are able to analyze natural DNA/RNA molecules without any manipulation and without amplification [70] TGSTs have average read length longer to 10 kb, the availability of long reads constitutes a great advantage.
The first SMS technology, was developed by Quake and commercialized in 2009 by Helicos BioSciences; it worked similar to Illumina sequencers, but without any bridge amplification [70,71].However, it was slow, expensive and produced relatively short reads, around 35 bp long; therefore, two single-molecule approaches were technologically advanced to overcame these disadvantages [72].
The first approach, Single Molecule Real-Time (SMRT) sequencing was developed by Craighead, Korlach, Turner and Webb and was further refined and commercialized by Pacific Biosciences (PacBio) since 2011 [73].The second approach, Nanopore sequencing, was first hypothesized in the 1990s and further developed and commercialized by Oxford Nanopore Technologies (ONT) since 2005; the advantages of SMRT sequencing over NGS have come at the price of higher per base sequencing costs [70].
Finally, DArTseq™ technique is based on genomic complexity reduction.This technique benefitted from the development in NGSTs and now DArTseq™ markers are replaced by NGS-DArT markers.Sansaloni et al. [60] found that the combined use of DArTseq™ with NGS make available more quantity of markers than conventional DArT method.DArTseq™ markers in combination with other molecular techniques have been used to create deeper genetic maps in C. arabica to perform association studies [4,74,75].

A future in genomic resources of C. arabica
Arabica's cultivars and landraces are generally propagated by seed.The mating system is primarily based on self-fertilization.Thereby, autogamy leads to high levels of inbreeding.Besides, an effective clonal propagation system is being adopted but limited for F1 Arabica hybrids.It is evident that molecular analyses of genetic diversity are needed to support this scenario [74,75].
The development of a new coffee variety takes about 25 years.An efficient selection can be addressed when sequencing approaches are adopted in the variety Genetic Diversity of Coffea arabica L.: A Genomic Approach DOI: http://dx.doi.org/10.5772/intechopen.96640development process [66,76].In the 1990s, Marker-Assisted Selection (MAS) was proposed, which enabled selecting individuals with specific alleles.However, MAS has shown to be inefficient in polygenic and/or low heritability traits [77].Due to its potential and importance, genome-wide selection (GS) was developed by Meuwissen et al. [78].
With the development of NGSTs, GS has become a reality for several economically important species.However, the procedure requires precaution for polyploid species, which have subgenomes with duplicate regions or with high similarity, such as C. arabica [77].Despite the economic importance of C. arabica, GS works in Arabica coffee are scarce.Coffee trees have been selected based on biometric analyses using phenotypic data of yield and resistance to biotic and abiotic stresses.However, due to the complexity and number of genes that control most of the agronomic traits of this Coffea spp., GS studies are promising for they allow estimating the effects of all loci that explain the genetic variation and the genomic estimated breeding value (GEBV) [74,75,77].
Genome sequencing initiatives of Arabica accessions have been launched by several research groups (https://coffeegenome.ucdavis.edu/,among others) but an open-access genome assembly, with a reliable sorting of homologous sequences, is not yet available [77,79].Decoding the allotetraploid genome of C. arabica is therefore required to have accurate GS studies in this species.

Conclusions
DArTseq™ technology identifies thousands of high quality SNP polymorphic markers in a timely and cost-effective manner.Our study confirmed that the genotyping method by DArTseq™ can be successfully used in studies of genetic diversity specially in coffee.In addition, trait-associated-SNPs identified by GWAS may be helpful to develop strategies aiming to improve the biochemical quality of coffee or another important trait.These SNPs markers may be useful for marker-assisted selection (MAS) in Arabica coffee breeding programs and genomic selection.

Figure 1 .
Figure 1.Heat map for the 87 accessions of Coffea spp.from the National Bank of Coffee Germplasm in Mexico using DArTseq Technology.Red small squares indicate an individual's genetic relatedness to itself, dark orange color represents high kinship relations while lighter colors (yellow) represent weaker relations.

Figure 2 .
Figure 2.Bar graphic of the STRUCTURE software used to study the diversity of the 87 coffee accessions using SNP marker data.The 87 genotypes are represented below the graphic, and were divided into five (K = 5) groups.