Summary of the bioinformatic tools used in search for potential SNPs involved in the flavonoid biosynthetic pathways.
This chapter describes the computational approach used in analyzing rice transcriptomics and genomics data to identify and annotate potential single nucleotide polymorphism (SNPs) as potential biomarker in the production of flavonoid. SNPs play a role in the accumulation of nutritional components (e.g. antioxidants), and flavonoid is one of them. However, the number of identified SNPs associated with flavonoid nutritional trait is still limited. We develop a knowledge-based bioinformatic workflow to search for specific SNPs and integration analysis on the SNPs and their co-expressed genes to investigate their influence on the gain/loss of functional genes that are involved in the production of flavonoids. Raw files obtained from the functional genomics studies can be analyzed in details to obtain a useful biological insight. Different tools, algorithms and databases are available to analyze the ontology, metabolic and pathway at the molecular level in order to observe the effects of gene and protein expression. The usage of different tools, algorithms and databases allows the integration, interpretation and the inference of analysis to provide better understanding of the biological meaning of the resutls. This chapter illustrates how to select and bring together several software to develop a specific bioinformatic workflow that processes and analyses omics data. The implementation of this bioinformatic workflow revealed the identification of potential flavonoid biosynthetic genes that can be used as guided-gene to screen the single nucleotide polymorphisms (SNPs) in the flavonoid biosynthetic genes from genome and transcriptomics data.
- comparative genomics
- nutritional traits
- colored rice
- integrative analysis
- single nucleotide polymorphism
- bioinformatics workflow
In recent years, high-throughput omics technologies (i.e. genomics, proteomics, transcriptomics and/or metabolomics) provide unprecedented opportunities to discover potential genes, proteins, metabolites, pathways and molecular markers for various applications. The availability of omics data produced from high-throughput omics technologies has facilitated the molecular and genetic improvement of rice varieties with higher yield, quality, nutrient dan resistance to the biotic and abiotic stresses. However, past 15 years have seen the significant increase of omics data in volume and types. This event has challenged the researchers to extract and decipher invaluable information encoded in the data. This challenge can be addressed with the use of bioinformatics in analyzing, integrating and interpreting these massive omics data. There are various bioinformatic tools and algorithms that can be used by the researchers to process, interpret and integrate data in a more efficient and reproducible way. However, lack of connectivity between various tools and algorithms complicates the process of extracting and deciphering the data. Hence there is the need to find procedures to connect these tools to develop workflows that can connect and bring together different techniques that are able to exploit data at many levels.
Here we describe a computational workflow composed of different bioinformatic tools that exploits data from large-scale gene expression experiments and contextualize them at many biological levels. To illustrate the relevance of our workflow, we applied it to data from rice varieties datasets in search for potential SNPs that are associated with flavonoid biosynthetic genes. The workflow started with identification of known flavonoid biosynthetic genes from published articles and database search using several genome and pathway databases such as. KEGG, PlantReactome, RiceCyc) and similarity search analysis.
The potential flavonoid biosynthetic genes were used as a guide-gene to screen for single nucleotide polymorphisms (SNPs) in the flavonoid biosynthetic genes from the genomics and transcriptomics data. Integration of SNP and co-expressed genes was performed via network analysis. The transcriptomics data was used to construct the gene co-expression network followed by the mapping of SNPs onto it. A pathway-network analysis was performed to interpret the biological information related to the flavonoid pathway-network. All information generated from these computational analyses are stored in MyNutRiceBase (
2. Overview of flavonoids in omics and rice breeding improvement
Several types of rice breeding traits such as disease resistance, drought tolerance, salinity tolerance, grain quality, high nutritional content are currently pursue by the breeders and farmers in their effort to improve the traits of rice varieties. However, nowadays breeding for improved rice varieties with high nutritional content has attracted interest among breeders, geneticists and nutritionists. High protein , carotenoid , micronutrients  and antioxidant content  are among the preferred nutritional contents improved in rice varieties. Rice breeding for high nutritional content has been carried out extensively by bio-fortification [5, 6] and/or genetic engineering . The improvement of nutritional content in rice is essential due to their benefits for human health. Lack of specific nutrition can lead to several diseases and malnutrition.
Flavonoids are secondary metabolites commonly produced in flowers, fruits, vegetables and pigmented rice. Flavonoids are known as a potent antioxidant beneficial for human health. Consumption of foods with high antioxidant contents may lower the risk of cardiovascular disease, type II diabetes and colon cancer . Additionally, flavonoid has shown to improve plant resistance against abiotic and biotic stress .
The flavonoid biosynthetic pathways are found in several crops such as maize, tomato and rice . Genes encoding enzymes involved in the flavonoid biosynthetic pathways are categorized into general phenylpropanoid, early biosynthetic genes (EBGs), late biosynthetic genes (LBGs) and transcription factors . General phenylpropanoid contains three major genes such as phenylalanine ammonia-lyase (PAL), cinnamic acid 4-hydroxylase (C4H) and 4-coumarate CoA ligase (4CL). The EBGs include chalcone synthase (CHS), chalcone isomerase (CHI), flavanone 3-hydroxylase (F3H) and flavanone 3′-hydroxylase (F3’H) . Genes in the general phenylpropanoid category and EBGs are upstream genes that initiate the flavonoid biosynthetic pathways and responsible in the production of secondary metabolites . The LBGs such as dihydroflavonol reductase (DFR), leucoanthocyanidin reductase (LAR), UDP-glucose flavonoid 3-O-glucosyl transferase (UGT) and leucoanthocyanidin oxidase (LDOX) lead to the production of anthocyanin . Genes categorized in EBGs and LBGs are also recognized as structural genes .
Four transcription factors involved in the flavonoid biosynthetic pathways are Kala4 (Os04g0557500), Rc (Os07g0211500), R2R3-MYB (Os06g0205100) and WD40 (Os02g0682500). Kala4 encodes a basic-helix–loop–helix (bHLH) and it activates late biosynthetic genes to produce black pigmentation or anthocyanin in the seed pericarps . Rc gene encodes for bHLH that regulates the Rd (Os01g0633500) expression to produce red pigmentation . However, Rc expression without the participation of Rd has resulted to brown pigmentation . Besides the existence of R2R3-MYB and WD40 in the flavonoid biosynthesis, they also responsible to regulate the pigmentation in purple leaves [14, 15].
Anthocyanin and proanthocyanidins are two major flavonoid compounds . Pigmented rice (black and red) is enriched with antioxidant due to the presence of anthocyanin and proanthocyanidin . Understanding the genetic basis of pigmented rice varieties is essential to develop rice varieties with high antioxidant (flavonoid, anthocyanin, proanthocyanidin) content. A study by  has dissected the regulation of flavonoid biosynthesis in edible rice tissue using a metabolic engineering approach. Bioinformatics approach was used to identify key flavonoid structural genes in the rice genome by species comparison against maize and sorghum as sequence homologs . A total of six genes encoding enzymes (CHS, CHI, F3H, F3’H, DFR and ANS) in the flavonoid biosynthetic pathway were selected for tBLASTN analysis against Nipponbare rice genome sequence. At least 66% amino acid identities were found in the rice genome sequences. The expression patterns of six flavonoid genes were analyzed to investigate the accumulation of flavonoids level in rice seedlings.
Previous study showed the correlation between sequence polymorphism and metabolite profiling affects the flavonoid accumumation between indica and japonica rice sub-species . Different accumulation of flavonoids in these two rice sub-species might due to the variation in flavonoid biosynthetic genes [19, 20]. A quantitative trait loci analysis was performed to develop molecular markers associated with antioxidant content in rice . The potential molecular markers associated with antioxidant content can be applied in the marker-assisted breeding towards the improvement of high-level antioxidant content in rice variety.
Understanding gene-based-SNP underlying flavonoid biosynthesis process is crucial in developing rice cultivars with higher flavonoid contents. Integration of single nucleotide polymorphisms (SNPs) and co-expressed genes can be used to identify causal SNPs and to prioritize the functional SNPs involved with flavonoid biosynthetic genes.
However, the study on molecular and genetic improvement of antioxidant content in rice via integration of multi-omics data is still limited. Current data is still inadequate to be applied in rice breeding selection. Furthermore, this constraint limits detailed insight into the underlying mechanism and regulation of antioxidant content on the system level. The association of SNPs in the causative biosynthetic genes might be useful to uncover key alleles that influence the accumulation of flavonoid. Linking the SNPs with their co-expressed genes that involved in the flavonoid biosynthesis process could be a promising approach to prioritize the functional SNPs and causal genes to be used in the experimental validation towards the molecular and genetic improvement of rice variety enriched with flavonoid content.
3. Data analysis workflow
3.1 Data mining of potential genes and transcription factor related to flavonoid biosynthesis
All genes that are related to the flavonoid genes were identified using similarity search analysis from OmicsBox (
Bibliomic, database and similarity search are the first steps in identifying candidate genes related to the biosynthesis pathways. These genes that can be used in downstream analysis1 such as evolutionary study  and identification of SNPs in the biosynthetic genes . From these steps, a total of 95 flavonoid related genes (FRGs) and two transcription factors were successfully identified. Structural genes (i.e. F3H, CHI, CHS,DFR, LDOX) are found to be more conserved than transcription factors. These flavonoid related genes were used to screen the SNPs that reside in the sequences. Figure 1 shows the three steps analysis workflow to identify the potential flavonoid related genes.
3.2 Mining of SNPs from genome and transcriptome data
Single nucleotide polymorphism (SNP) has been widely used as a genetic marker tool in crop improvement. Previous studies used SNPs to investigate the evolutionary relationship , to facilitate cultivar identification  and to associate SNPs with agronomic traits . SNPs located in the intergenic and genic region . The genic SNPs can be classified into the coding region, untranslated region (UTR) and intron. SNPs in the coding region are divided into synonymous and non-synonymous. The non-synonymous SNPs has two effects such as deleterious and tolerance that might represent causal genetic variation which could lead to the phenotypic consequences .
SNPs in the coding regions are usually associated with functional SNP and they can influence the phenotypic expression of agronomic traits such as resistance to disease, resistance to drought and high nutritional content . The non-synonymous SNPs might affect the protein function through amino acid substitution  whilst synonymous SNP has the ability to affect gene function by regulating mRNA splicing , stability  and protein translation . However, synonymous SNP is not a preferred polymorphism to be further validated. Therefore, the utilization of non-synonymous SNPs is important to prioritize their involvement in the biosynthetic pathway and to determine the effect of non-synonymous SNPs to phenotypic expression.
SNPs in the intron region are able to shift the gene splicing or regulate the transcript level by changing the binding sites of miRNA . Lower number of SNPs in untranslated region (UTR) has demonstrated that mutation in this region can change local mRNA structure and will affect the translation process . SNPs in the UTR region are conserved and consist of binding sites for proteins or antisense RNAs that able to modulate transport, RNA stability, cellular localization, expression level and translation .
Functional effect of SNPs in the flavonoid related genes can affect the expression and regulation of structural genes and transcription factor during the flavonoid biosynthesis process. Previous study has demonstrated that genetic variation caused by SNP can influenced the variation and accumulation of metabolites in the biosynthetic pathway [19, 20]. By selecting SNPs with specific functional effect such as non-synonymous and deleterious, it can be effectively used as a molecular marker in the development of new and improved rice variety with agronomic traits of interest.
However, it is a challenge to depend only on SNPs to reveal the interaction and relationship that occur in the multiple levels of biological mechanisms , especially in the qualitative and complex traits (i.e. abiotic stress, quality, yield). Therefore, numerous studies have been performed to identify and evaluate the functional effects of SNPs through multi-omics data integration. For instance, the integration of whole genome and transcriptome data offers opportunities to highlight the expressed SNPs  and to better understand the biological processes and mechanisms underlying the pigmented rice varieties. Previous study has performed integration of SNPs with gene co-expression network to identify the causal genes and causal SNPs that might be responsible in the mechanism of blast disease resistance , salt tolerance  and amylose content . Throughout these integration processes, several bioinformatics tools have been applied to identify the putative SNPs and to annotate the SNPs into their functional effects.
The identification of SNPs in the flavonoid related genes started with mining of SNPs from the genome and transcriptome datasets of O. sativa indica cv. Bali, PH9, MRM16, MRQ100, MR297 and MRQ76 [43, 44]. Bali, PH9, MRM16 and MRQ100 are pigmented rice varieties that contain high antioxidant contents. MR297 and MRQ76 are white rice varieties with medium tolerance to diseases. Figure 1 shows the bioinformatic workflow to identify SNPs from the genome and transcriptome datasets. The first step in SNPs identification is to map the sequence reads onto selected reference genome sequences; in this case it was Nipponbare. It was chosen because of it was well-assembled and annotated. The genome and transcriptome of reads mapping were individually aligned using different mapping tools such as BWA  for genome reads mapping and TopHat2  for transcriptome reads mapping.
Then PICARD version 0.7.12  was used to add and replace the read groups, marking the duplicate reads and fixing mate information on the mapped reads in order to obtain the high-quality SNPs during SNPs calling. This process is known as post-processing and it is a standard step used to identify potential SNPs from the genome and transcriptome datasets. Once this process is completed, GATK version 3.6  was used in the SNPs calling process and this is a crucial step in SNPs discovery as it helps in obtaining high-quality SNPs as well as reducing false-positive SNPs. In our case, the parameters used are as follow:
mapping quality (MQ) more than and equal to 30 (MQ > = 30);
variant quality more than and equal to 50 (VQ > =50);
number of supporting reads for every base (DP) more than and equal to 10 (DP > = 10)
All SNPs obtained from the filtering process were annotated using several annotation tools such as SnpEff version 4.1 , Variant Effect Predictor (VEP) , CooVar  and Annovar  for their intergenic, genic, coding region, UTR, intron, synonymous and non-synonymous. Most of these tools can be locally installed to ease the users who perform large-number of SNPs annotation. Those SNPs that were annotated as genic were then filtered using R packages (i.e. dplyr, tidyr, sqldf) for the identification of SNP position, chromosome, allele, gene identifier and SNPs effect. This information was then used in matching the SNPs in genome and transcriptome datasets as well as screening all SNPs that reside in the flavonoid related genes.
Comparative SNPs analysis of six rice varieties were performed to investigate the uniqueness, differences and similarities in the potential SNPs. Currently several rice SNP databases are available to providing the information on SNPs that were mined from various rice varieties and sub-specie such as Rice SNP-Seek , Rice VarMap , IC4R , and Ensembl Plants Variation database . Comparative SNPs analysis on SNP databases can provide the SNPs information and usability in various rice varieties and sub-species.
SNP occurs in transcription factor often lead to altering or loss of function of key pathway enzymes that are required to regulating the production of anthocyanin . Hence, it could affect the expression levels of flavonoid. Previous research has investigated that mutations were accumulated during the domestication process, which suggests the presence of agronomically valuable genes in landraces as well as in wild relative . Hence, SNPs in transcription factor, such as Kala4 and Rc, can be prioritized for further integration with genes co-expression network.
3.3 Gene co-expression network analysis from transcriptome data
A gene co-expression network analysis was performed to correlate gene and phenotypic expressions, to infer the function of unknown genes and to identify the key regulatory networks in biosynthetic processes [57, 58]. The principle used in the gene co-expression network shows that genes that cluster in the same network represent similar biological process . To date, the increasing number of genes co-expression network databases such as ATTED, STRING, RED facilitate the exploitation of co-expressed genes from different conditions in various crops . In the co-expressed gene databases, the gene identifier or gene name can be used as a query to search for the co-expressed genes of interest. Parameter, such as maximum number of interactors = 0.1 and confidence score cut-off = 0.40 are used to select the significant co-expressed genes. Most of the co-expressed databases obtained the information from microarray gene expression and transcriptome datasets from public databases such as NCBI Gene Expression Omnibus (NCBI GEO)  and Sequence Read Archive (SRA) .
Gene co-expression network analysis was performed using the expression values of genes in fragments per kilobase of transcript per million mapped reads (FPKM) as an input data. Pearson correlation coefficient (PCC) value was used to measure the co-expression correlation between paired genes in the network. The PCC score represents the confidence level describing the association of the two genes whether they are functionally associated. In this case, the FPKM value with more than and equal to 0.1 (FPKM > = 0.1) was used to perform a gene co-expression network analysis. Pearson Correlation Coefficient (PCC) in corr R package  was used to measure the correlation between paired genes in the network. The cut-off value of PCC more than and equal to 0.7 (PCC > = 0.7) was used for selection of co-expressed genes. The flavonoid related genes were selected as guided-genes to identify their interactors. Figure 1 shows the analysis workflow to perform a gene co-expression network analysis.
Cytoscape v3.7 was used to construct and visualize the gene co-expression network. The Network Analyzer plugin was used to calculate the degree connectivity of each nodes and edges. ShinyGO version 0.6.1  was used in gene ontology and pathway enrichment analysis to elucidate the biological processes and molecular function in clusters of network.
4. Data integration
4.1 Integration of SNPs with co-expressed genes
There are several publications on the bioinformatics approaches on the functional effect of SNPs where by assessing their effect on the functional site of protein structure [64, 65] and integrate with biological pathway . In this study, SNP, especially with non-synonymous effect and deleterious impact is better appreciated to study its impact through the pathway-network analysis coupled with a different omics data type.
SNPs can be integrated with omics data type to rank the high-potential SNPs and to highlight the causal genes for development of genetic markers and functional genomics studies. The integration of SNPs and co-expressed genes through construction of biological network offers a comprehensive interpretation of genetic variation at the biological system level. Integrative approach links biologically meaningful sets of genes to reveal the molecular basis of trait variation.
In this study, integration of SNPs and co-expressed genes was performed to prioritize the functional SNPs that can be suggested as a candidate for the development of molecular markers and to highlight the causal genes that might contribute in the flavonoid biosynthesis process. The co-expressed genes can be suggested as a functional target for functional validation in the future study to unravel their potential in rice breeding improvement.
Integration of SNPs and co-expressed genes was performed using Cytoscape v3.7 . Flavonoid biosynthetic pathway templates in BioPAX, KGXML and SIF formats was retrieved from KEGG, Plant Reactome and RiceCyc databases. To merge the SNP and co-expressed genes, gene identifier (ID) was used as a matching identity. Every gene in the network consists of gene ID, chromosome, start and stop position based on physical genome coordinates (bp). Similarly, SNPs consists of the position in genome locations (bp) and gene ID.
Analysis on network integration between SNPs and co-expressed genes highlight that co-expressed genes can be integrated by multiple numbers of SNPs and will reveal, which SNPs appear to play an essential role in the flavonoid biosynthesis process. The co-expressed genes connected to SNPs can be prioritized as candidate genes. The high false-positive rate in SNPs also can be reduced by incorporating putative functional co-expressed genes information . Several biological questions can be asked from the integration of SNPs and co-expressed genes into pathway-network, such as i) how many functional SNPs and co-expressed genes important to the expression of black and red pigmentation; ii) any regulatory genes regulate the biosynthetic pathway?; iii) and which biological process are underlying this trait.
4.2 Pathway-network analysis
There are different ways in bioinformatics approaches that could be applied to the pathway-network analysis. In pathway-network analysis, the description of connected genes can be interpreted into biologically meaningful information and provide insights into biological processes, molecular function and cellular components. This analysis is known as gene ontology (GO) enrichment and pathway enrichment analysis.
To date, several bioinformatic tools are available to perform GO and pathway enrichment analysis in the network. For example, ClueGO  and BiNGO . Both plugins are available in the Cytoscape and are user-friendly. Statistical values such as Hypergeometric testing and Bonferroni method is used to calculate the p-value. Parameter such as p-value less and equal than 0.05 (p-value <= 005) and a minimum number of mapping entries > = 2 can be used to select the significant or enrich genes in the pathway-network.
|BLASTX||Sequence similarity search analysis|||
|eggNog||Annotation of orthologous and paralogous sequences|||
|TopHat2 v2.3||Transcriptome mapping|||
|Picard v0.7.12||Post mapping processes|||
|GATK v3.6 (HaplotypeCaller)||SNPs discovery|||
|SnpEff v4.1||SNPs annotation|||
|R package (corrr)||Gene co-expression analysis|||
|ShinyGO||Gene ontology enrichment analysis|||
|Cytoscape v3.7||Data integration|||
|Gene ontology enrichment analysis|||
(dplyr, tidyr, sqldf)
|Data cleaning and filtering|||
|Rice Annotation Project Database|
|Is a rice genome database that has been developed by the International Rice Genome Sequencing. Information provided are genome sequences, chromosom, gene annotation dan description.|||
|Kyoto Encyclopedia of Genes and Genomes (KEGG)||A pathway database that provides biological information related to genes, proteins, enzymes and pathways involve in biological systems.|||
|PlantReactome||A plant pathway database that provides user genes, proteins, enzymes and reactions involve in the specific biological systems.|||
|RiceCyc||A rice metabolic pathway database that has been developed to provide predicted biochemical pathways in rice. Several biological information, such as genes, proteins, enzymes and reactions have been displayed in diagram and all the data can be downloaded by user.|||
5. Repository of omics data
Continuously increased number of biological data is in need for a database to systematically store, organize and manage them. To date, several rice databases are available for the application in rice breeding programme, such as rice genome database , SNP databases [53, 54, 78] and pathway databases [75, 76]. Each of these databases have their uniqueness and specific target users.
In this study, the flavonoid related gene, co-expressed gene and SNP are stored in a one-stop database that is specifically developed as a genetics and genomics repository to keep all information related to nutritional traits in rice known as MyNutRiceBase (
To date, several bioinformatics tools are available for the researcher to use and connect in an analysis pipeline. Criteria and parameter are the essential part that always carefully revised and look into while performing the bioinformatics analysis. This review highlights bioinformatics workflow used in the identification of SNPs in genomic and transcriptomic data, gene co-expression network analysis, omics data integration. This result facilitates the interpretation of SNPs into comprehensive biological information with the identification of potential. By integrating these two types of data, it may shed some light on the roles of various genes and SNPs that may play an essential role in the accumulation of flavonoid content in rice.
This work was carried out at the Centre for Bioinformatics Research, Institute of Systems Biology (INBIOSIS), Universiti Kebangsaan Malaysia and Malaysian Agricultural Research & Development Institute (MARDI). The open access publishing fees are funded by GP-2020-K007217 and GP-2019-K021204 grants.
Conflict of interest
The authors declare no conflict of interest.