Summary of currently available database resources for retrieving genomic data for biosynthesis prediction.
We present an overview of computational approaches for the prediction of metabolic pathways by which plants biosynthesise compounds, with a focus on selected very promising anticancer secondary metabolites from floral sources. We also provide an overview of databases for the retrieval of useful genomic data, discussing the strengths and limitations of selected prediction software and the main computational tools (and methods), which could be employed for the investigation of the uncharted routes towards the biosynthesis of some of the identified anticancer metabolites from plant sources, eventually using specific examples to address some knowledge gaps when using these approaches.
- computational prediction
- natural products
- plant metabolism
An immense number of secondary metabolites (SMs) exist in nature, originating from plants, bacteria, fungi and marine life forms, serving as drugs for the treatment of many life‐threatening diseases, including cancer [1–4]. Taxol, vinblastine, vincristine, podophyllotoxin and camptothecin, for example, are typically well‐known drugs used in cancer treatment, which are of plant origin. The search for drugs against cancer has often resorted to plants and marine life for lead compounds. To illustrate this, Newmann and Cragg published a recent study in which it was shown that ~49% of drugs used in cancer treatment were either natural products (NPs) or their derivatives . We would henceforth refer to SMs and NPs interchangeably, since NPs are the products of secondary (or specialised) metabolism, as opposed to primary metabolism, which results in molecules playing a key role in physiological processes of the organism and are thus necessary for the plant’s survival. It should be mentioned that SMs are important for the plant’s defence against attacks by other organisms. Several efforts have also been made towards the collection of data on naturally occurring plant metabolites showing anticancer properties. As an example, Mangal and co‐workers published the naturally occurring plant‐based anti‐cancer compound activity‐target database (NPACT), containing about 1,500 NPs . In addition to the experimentally verified in vitro and in vivo data for these NPs, the authors also include biological activities (in the form of IC50s, ED50s, EC50s, GI50s, etc.), along with physical, elemental and topological properties of the NPs, the tested cancer types, cell lines, protein targets, commercial suppliers and drug likeness of the NPACT compounds. A similar effort was published the following year, for NPs from African flora, resulting in a dataset of about 400 compounds, named AfroCancer . A further study showed that the NPACT and AfroCancer datasets showed little intersection, thus providing us a combined dataset of about 2,000 NPs . The anticancer properties of some of the most promising AfroCancer compounds have been described in detail in recent reviews [9–12]. Further curation of data from Northern African species has recently resulted in the Northern African Natural Products Database (NANPDB), a web accessible and completely downloadable vast database of NPs, with a significant proportion of anticancer metabolites . The NANPDB effort was founded on the observation that the Northern Africa region is particularly highly endowed with diverse vegetation types, serving as a huge reservoir of bioactive natural products [14–16].
For decades, NPs were identified exclusively by using chemical identification based on bioactivity‐guided screening approaches. Recently, it has been postulated that genomics and bioinformatics would transform the approach of natural products discovery, even though genome mining has had only little influence on the advancement of natural product discovery until now . Several algorithms have been developed for the mining of the (meta)genomic data, which continue to be generated. Computational methods and tools for the identification of biosynthetic gene clusters (BGCs, which are physically clustered groups of a few genes in a particular genome that together encode a biosynthetic pathway for the production of a specialised metabolite) in genome sequences and the prediction of chemical structures of their products have been developed . BGCs for SM biosynthetic pathways are important in bacteria and filamentous fungi, with examples being recently discovered in plants [19, 20], although some metabolic processes in plants, for example, the thalianol pathway for triterpene synthesis in Arabidopsis thaliana has been suggested to be controlled by operon‐like (clusters of unrelated) gene clusters . This, coupled with the rapid progress in sequencing technologies has led to the development of new screening methods, which focus on whole genome sequences of the organisms producing the NPs. Genome mining approaches for NP discovery basically focus on:
identifying the genes of the organism involved in the biosynthesis of the NPs,
identifying the metabolic pathways by which the NPs are biosynthesised and
predicting the products of the identified pathways (Figure 1A).
The four main strategies that are mostly employed to identify such pathways are based on processes involved in the production of plant secondary metabolites, for example, physical clustering, co‐expression, evolutionary co‐occurrence and epigenomic co‐regulation of the genes [22–25]. Such approaches have been successfully applied for the investigation of fungal and microbial metabolites [26–28]. Since the discovery of the first gene cluster for secondary metabolism in Zea mays, the corn species , BGCs for plant secondary metabolism have become an emerging theme in plant biology . It is even believed that synthetic biology technologies will eventually lead to the effective functional reconstitution of candidate pathways using a variety of genetic systems . A knowledge of BGCs and their manipulation is therefore important in understanding how to activate a number of ‘silent’ gene clusters observed from the investigation of whole‐genome sequencing of organisms. This would make available a wealth of new chemical entities (NCEs), which could be evaluated as drug leads and biologically active compounds .
This chapter aims at discussing the metabolic pathways by which plants biosynthesise compounds with anticancer activities, with a focus on selected very promising anticancer SMs from the African flora. We also aim to provide an overview of computational tools, which have been used to predict metabolic pathways and eventually address knowledge gaps when using the former. Additionally, we will present some databases for the retrieval of useful genomic data, discuss the strengths and limitations of selected computational (prediction) tools, which could be employed for the investigation of the uncharted routes towards the biosynthesis of some of the identified anticancer metabolites from plant sources, with specific examples. It is believed that properly addressing knowledge gaps that exist would lay the foundation for proper future investigations.
2. Natural products and plant genomic data
Genome data mining indicates that the vast majority of plant‐based NPs have not yet been discovered [24, 25]. In addition, SMs are normally produced only at later growth stages of plant metabolism and are frequently found only at low concentrations within complex mixtures in plant extracts, due to several factors. Some of these factors include physiological variations, geographic variations, environmental conditions and genetic factors [25, 31, 32]. The aforementioned factors are the main drawbacks in the isolation and purification of NPs in meaningful quantities for either research or commercial aims. Nowadays, BGCs can be investigated using computational methodologies and used to predict the NPs present in microbial, fungal and floral matter [18, 20, 33, 34]. It is current knowledge that more than 70 genome sequences for several plant species have been made available, along with a wealth of transcriptome data . However, the interpretation of such data, for example, the translation of predicted sequences into enzymes, pathways and SMs remains challenging. Advances in bioinformatics and synthetic biology have permitted the cheap and efficient overproduction of secondary metabolites of medicinal interest in heterologous (non‐native) host organisms by reengineering of BGCs . This is carried out through reengineering of BGCs as well as the activation of silent BGCs to yield unreported natural products of the target chemical space [17, 36], for example, an engineered Escherichia coli strain was used as the heterologous host organism for the production of taxadiene (a vital precursor of paclitaxel, an anticancer agent isolated from the bark of Taxus brevifolia), a precursor of the anticancer agent taxol . In this way, quite a number of interesting SMs of plant origin (e.g. resveratrol, vanillin, conolidin, etc.) have been objects of pathway engineering in bacteria, yeast and other plants . Thus, chemical libraries of diverse and novel hybrid natural products analogues can now be generated through combinatorial biosynthesis by manipulation of biosynthetic enzymes , for example, several analogues of the antibiotic erythromycin were obtained via combinatorial biosynthesis . Such bioengineered libraries of ‘unnatural’ natural products show promises in drug discovery campaigns against multidrug‐resistant cancer cells.
3. Some database resources for retrieving secondary metabolism prediction information
A summary of databases for retrieving information on BGCs is provided in Table 1. A majority of them focus on microbial BGCs, for example, ClusterMine360, ClustScan, DoBISCUIT, IMG‐ABC and the Recombinant ClustScan Database. Details on the utility of the aforementioned databases have been provided in excellent recent reviews [26–28, 53]. Further efforts towards the construction of plant‐based BGC and genomic databases include those of the Medicinal Plants Genomics and Metabolomics Resource consortium . This effort has been focused on 14 medicinal plants and includes a BLAST search module, a genome browser, a genome putative search function tool and transcriptome search tools. While the entire database is available for download, similar efforts from the Plant Metabolic Network (PMN) have the advantage of having included several plant metabolic pathway databases, mostly among food crops [49, 50]. The PMN, for example, currently houses one multi‐species reference database called PlantCyc and 22 species/taxon‐specific databases, providing access to manually curated and/or computationally predicted information about enzymes, pathways, and more for individual species.
|ClusterMine360||A database of microbial polyketide and non‐ribosomal peptide gene clusters.||Users can make contributions. Automation leads to high data consistency and quality data.||Focuses only on microbial PKS/NRPS biosynthesis||[41, 42]|
|ClustScan Database||A database for in silico detection of promising new compounds.||Allows easy extraction of DNA and protein sequences of polypeptides, modules, and domains.||Currently includes data for only 57 SMs (PKS), 51 SMs (NRPS) and 62 SMs (PKS‐NRPS hybrid) biosynthesis.||[43, 44]|
|DoBISCUIT||A database of secondary metabolite biosynthetic gene clusters.||Provides standardised gene/module/domain descriptions related to the gene clusters. Available for download||Contains mostly data relating to bacterial species, mostly of the genus Streptomyces.|||
|GenomeNet||A network of databases and computational services for genome research and related research areas in biomedical sciences.||Provides several web accessible tools, e.g. KEGG, E‐zyme, etc. See Table 2.|
|IMG‐ABC||A knowledge base for biosynthetic gene clusters for the discovery of novel SMs.||Integrates structural and functional genomics with annotated BGCs and associated SMs.||Not available for download. Limited to data on microbes|||
|Medicinal Plants Genomics Resource||A database for medicinal plants genome sequence data.||Available for download||Only genomic data for 14 species are currently available.|||
|Medicinal Plants Metabolomics Resource||A database for medicinal plants metabolomics data.||Available for download||Currently limited to metabolite data for 2 medicinal plant species.|||
|Minimum Information about a Biosynthetic Gene cluster (MIBiG)||A community standard for annotations and metadata on biosynthetic gene clusters and their molecular products.||Facilitates the standardised deposition and retrieval of biosynthetic gene cluster data. Useful for the development of comprehensive comparative analysis tools. Available for download|||
|Plant Metabolic Network (PMN)||Several plant metabolic pathway databases.||Includes species/taxon‐specific data for more than 22 plant species.||[49, 50]|
|Plant Reactome/”Cyc” Pathways||A pathway database for several crops and model plant species.||Currently includes gene homology‐based pathway projections to 62 plant species.|||
|Recombinant ClustScan Database||A database of gene cluster recombinants and their corresponding chemical structures.||Provides a virtual compound library, which could be a useful resource for computer‐aided drug design of pharmaceutically relevant chemical entities.||Currently contains only 47 cluster combinations||[44, 52]|
|SMBP||Secondary metabolites bioinformatics portal.||Includes hand‐curated links to all major tools and databases commonly used in the field|||
It provides a broad network of plant metabolic pathway databases that contain curated information from the literature and computational analyses about the genes, enzymes, compounds, reactions and pathways involved in primary and secondary metabolism in the included plant species. The PlantCyc database also provides access to manually curated or reviewed information about shared and unique metabolic pathways present in over 350 plant species. On the other hand, Plant Reactome is a pathway database for several crops and model plant species, making use of a framework of a eukaryotic cell model. Currently, it uses rice as a reference species and gene homology‐based pathway projections have been made to 62 plant species .
4. Some computational tools for the analysis of genomic data and specialised metabolism prediction
Some computational tools for biochemical pathway prediction have been summarised in excellent reviews . We have provided a more detailed summary of the main tools that could be useful in analysing plant and microbial genomic data for metabolism prediction in Table 2. Some of the tools are designed for the detection and analysis of specialised metabolism in microbes (e.g. antiSMASH, CompGen, GNP, PRISM and WebAUGUSTUS). Others are specially designed for plant metabolism prediction or may only include data for some specific organisms (e.g. AraNet, MADIBA, miP3v2, PlantClusterFinder, SAVI and WikiPathways for plants), while others are more general tools, useful for both microbial and plant metabolism prediction and BGC analysis (e.g. E‐zyme, KEGG, PathPred and PathComp) and others are more useful for developers (e.g. Geneious, OptFlux, PathVisio and Pathway GeneSWAPPER), Figure 1B. We could also classify the tools according to their respective tasks; prediction and analysis of BGCs (e.g. antiSMASH, MADIBA, Pathway GeneSWAPPER, WebAUGUSTUS), searching, visualisation and prediction of biosynthetic pathways and reaction paths (e.g. BioCyc, CycSim, FMM, GNP, KEGG, MetaCyc, PathComp, PathPred, PathSearch, PathVisio, Pathway GeneSWAPPER, PlantClusterFinder, SAVI, WikiPathways for plants), prediction of SMs (PRISM), metabolic engineering (OptFlux), other functions (miP3v2). Among the tools for specialised metabolism in plants, AraNet is a probabilistic functional gene network (with currently a total of 27,029 protein‐encoding genes) of A. thaliana. It is based on a modified Bayesian integration of data from multiple organisms, each data type being weighted based on how well it links genes that are known to function together in A. thaliana. Each interaction is associated with a log‐likelihood score (LLS), which is a measure of the probability of an interaction representing a true functional linkage between two genes . On the other hand, MADIBA facilitates the interpretation of Plasmodium and plant (data currently available for Oryza sativa and A. thaliana) gene clusters . This tool eases the task by automating the post‐processing stage during the assignment of biological meaning to gene expression clusters. MADIBA is designed as a relational database and has stored data from gene to pathway for the aforementioned species. Tools within the GUI allow the rapid analyses of each cluster with the view of identifying the Gene Ontology terms, as well as visualising the metabolic pathways where the genes are implicated, their genomic localisations, putative common transcriptional regulatory elements in the upstream sequences, and an analysis specific to the organism being studied.
|antiSMASH*||A web server and tool for the automatic genomic identification and analysis of biosynthetic gene clusters.||Detects putative gene clusters of unknown types. Identifies similarities of identified clusters to any of 1172 clusters with known end products, etc.||Designed for analysis of BGCs in microbes.|||
|AraNet||Gene function identification and genetic dissection of plant traits.||Had greater precision than literature‐based protein interactions (21%) for 55% of tested genes. Is highly predictive for diverse biological pathways.||Applicability is limited to one species ‐ A. thaliana.|||
|BioCyc/CycSim/MetaCyc||Online tools for genome‐scale metabolic modelling.||Support the design and simulation of knockout experiments, e.g. deletions mutants on specified media, etc.||[57, 58]|
|CompGen||Carry out in silico homologous recombination between gene clusters.||Focuses on gene clusters encoding PKSs in Streptomyces sp. and related bacterial genera.|||
|E‐zyme||Assignment of EC numbers.||Classifies enzymatic reactions and links the enzyme genes or proteins to reactions in metabolic pathways.|||
|From Metabolite to metabolite (FMM)||A web server to find biosynthetic routes between two metabolites within the KEGG database.||Both local and global graphical views of the metabolic pathways are designed.|||
|Geneious||Organisation and analysis of sequence data.||Includes a public application programming interface (API) available for developers. Freely available for download.|||
|Genomes-to-Natural Products platform (GNP)||Prediction, combinatorial design and identification of PKs and NRPs from biosynthetic assembly lines.||Uses LC–MS/MS data of crude extracts to make predictions in a high-throughput manner.||Focuses on bacterial NPs.|||
|Gene Regulatory network inference ACcuracy Enhancement (GRACE)||An algorithm to enhance the accuracy of transcriptional gene regulatory networks.||Focuses on plant species. Available for download.||Only algorithm is available. Lacks a graphical user interface|
|KEGG Mapper||A tool to search a biosynthetic pathway.||KEGG is applicable to all organisms and enables interpretation of high-level functions from genomic and molecular data.|||
|MicroArray Data Interface for Biological Annotation (MADIBA)||A webserver toolkit for biological interpretation of Plasmodium and plant gene clusters.||It allows rapid gene cluster analyses and the identification of the relevant Gene Ontology terms, visualisation of metabolic pathways, genomic localisations, etc.||Only 2 plant species are currently considered [rice (Oryza sativa), and A. thaliana].|||
|miP3v2||Predicts microproteins in a sequenced genome.||Sheds light on the prevalence, biological roles, and evolution of microProteins.||Only the algorithm is available. Lacks a graphical user interface|||
|OptFlux||A software platform for in silico metabolic engineering.||Open source platform. Integrates visualisation tools. Allows users to load a genome-scale model of a given organism. Wild type and mutants can be simulated. Available for download.|||
|PathComp||Possible reaction path computation.|
|PathPred||Prediction of biodegradation and/or biosynthetic pathways.||Specifically designed for biosynthesis of SMs (in plants) and xenobiotics biodegradation of environmental compounds (by bacteria).|||
|PathSearch||Search for similar reaction pathways.|
|PathVisio||A biological pathway analysis software that allows users to draw, edit and analyse biological pathways.||Plugins are included, which provide advanced analysis methods, visualisation options or additional import/export functionality. Available for download.||[68, 69]|
|Pathway GeneSWAPPER||Maps homologous genes from one species onto the PathVisio pathway diagram of another species.||Improves the functionalities of PathVisio and WikiPathways for plants.|||
|PlantClusterFinder||Predicts metabolic gene clusters from plant genomes.||Focuses on plant species. Available for download.||Only the algorithm is available. Lacks a graphical user interface|
|Prediction informatics for secondary metabolomes (PRISM)||Genomes to natural products prediction informatics for secondary metabolomes.||Open-source, user-friendly web available application.||Focuses on microbial SMs.|||
|RetroPath||A webserver for retrosynthetic pathway design.||Integrates pathway prediction and ranking, prediction of compatibility with host genes, toxicity prediction and metabolic modeling.||[72, 73]|
|Semi-Automated Validation Infrastructure (SAVI)||Predicts metabolic pathways using pathway metadata (e.g. taxonomic distribution, key reactions, etc.).||Decides which pathways to keep, remove or validate manually. Available for download.||Only the algorithm is available. Lacks a graphical user interface.|
|WebAUGUSTUS||Gene prediction tool.||One of the most accurate tools for eukaryotic gene prediction.||Focuses on eukaryotes.|||
|WikiPathways for plants||A community pathway curation portal.||Freely available.||Currently limited to rice and Arabidopsis sp.||[70, 75, 76]|
PlantClusterFinder, SAVI and WikiPathways for plants are all purpose tools designed to assist in the prediction of metabolic gene cluster from plant genomes, although WikiPathways for plants has currently included mostly data for rice and Arabidopsis sp. SAVI has the added advantage of offering the user the possibility of including pathway metadata (e.g. taxonomic distribution, key reactions, etc.) and offering the possibility to decide which pathway(s) to keep and which to remove or validate manually.
5. Some computational methods for efficient production and the de novo engineering of natural products
Two main areas for computational tools can be distinguished: on the one hand the rational modification of genomes for the production of molecules by host organisms, and on the other hand the modification or the de novo design of gene clusters for the biosynthesis of novel NPs. For both genetic engineering approaches, the already known genomes of bacteria, fungi and more and more plants provide the basic datasets. A very important computational approach for a rational modification of NP-producing host organisms is the genome-scale metabolic modelling [77, 78].
Automatic assignments of functional annotations of all genes in a genome are ideally proven by manual curation and enriched by current knowledge about the metabolic network of subjected organisms. The curated genomes are then applied to a complete automatic reconstruction of the metabolic pathways of the cell. These metabolic models are normally encoded in the Systems Biology Markup Language (SBML) and are compatible with various software tools, for example, Cytoscape , which can be applied for static network analyses. For instance, missing enzymes (gaps) within the network become apparent by substrates that are not taken up or have not been produced by the cell, as well as products that are not consumed by other reactions and are not secreted from cell. The RAST annotation pipeline provides a full automatic server for predicting all gene functions and discovering new pathways in microbial genomes of bacteria . Such models can then be used to predict the turnover rate of each reaction in a Flux Balance Analysis (FBA) . Several tools have been built, which apply FBA to identify enzymes that should be either introduced or knocked-out in the organism to increase production rate in the host organisms. A widely used FBA package is the MATLAB-based COBRA Toolbox . With CycSim , BioMet  and FAME  powerful web-based FBA applications were published that do not require any software installation.
Within the last 10 years, FBA was applied to support numerous genetic engineering approaches, for example, for the determination of minimal media in Helicobacter pylori , for growth rate predictions in Bacillus subtilis  or for the development of metabolic engineering strategies in Pseudomonas putida . Based on FBA, it was possible to increase vanillin production in baker’s yeast by twofold and enhance sesquiterpene production in the same species [88, 89].
The rational modification of a given genome to design novel molecules needs a detailed understanding of the producing gene clusters. Well-studied gene clusters such as polyketide synthases consist of specific domain types that can be identified by trained hidden Markov models that are stored in related databases, for example, PFAM . Gene cluster analysis tools such as antiSMASH [55, 91] or PRISM  analyse a given gene cluster to predict the specific domains and to describe the architecture of a gene cluster. However, the prediction of the structure of the resulting natural products is a difficult task because substrate recognition of active sites and the correct ordering of enzymatic reactions has to be predicted. If subjected enzymes are catalysing multiple substrates, the availability of each substrate has to be predicted. Most frequently, the automatic analysis of a cluster is based on the deduction of information from gene clusters similar to the queried one. If well-annotated similar gene clusters do not exist, the prediction of the structure of the biosynthesised NP is challenging. With more and more knowledge about the structure of natural products and the encoding sequences, the relation between the composition of the active sites and substrate binding will be better understood. Existing algorithms are often based on machine-learning approaches and predict the correct substrates for a selected set of enzyme families . For the prediction of NPs synthesised by non-ribosomal peptide synthetases, such a sequence-based prediction method is integrated in the related web-server NRPSpredictor2 . Rational substitution of residues to generate novel molecules still requires a detailed manual analysis of the encoding gene cluster, and new software tools that propose mutations leading to novel molecules might accelerate this approach considerably in future.
6. Selected natural products with promising anticancer properties from African sources
Recent reviews on the anticancer potential of African flora have discussed the anticancer, cytotoxic, antiproferative and antitumour activities of about 500 NPs [9–12]. In this section, we focus on the most promising (recent) results for anticancer SMs from African flora (Table 3, Figure 2), published after the last reviews. The isolation of two new lignans; 3α-O-(β-D-glucopyranosyl) desoxypodophyllotoxin (1) and 4-O-(β-D-glucopyranosyl) dehydropodophyllotoxin (2), alongside other known lignans (3 and 4), have been reported from the species, Cleistanthus boivinianus (Phyllanthaceae), collected in Madagascar (coordinates 13°06′37″S 049°09′39″E) . These compounds showed potent to moderate antiproliferative activities against the A2780 ovarian cancer cell line, with compound 1 showing potent antiproliferative activity against the HCT-116 human colon carcinoma cell line (IC50 = 0.03 µM). The known compounds with promising activities from this species included the lignans; (±)-β-apopicropodophyllin (3, PubChem CID: 6452099), (−)-desoxypodophyllotoxin (4, PubChem CID: 345501). The same authors also isolated a new butanolide, macrocarpolide A (5, PubChem CID: 122372160) and two new secobutanolides; macrocarpolides B (6, PubChem CID: 122372161) and C (7, PubChem CID: 122372162), together with other known compounds from the ethanol extract of the roots of the Madagascan species Ocotea macrocarpa (Lauraceae), which showed antiproliferative activities against the A2780 ovarian cell line . The known isolates included the butanolides; linderanolide B (8, PubChem CID: 53308122) and isolinderanolide (9, PubChem CID: 44576054). The anticancer activities showed IC50 values of 2.57 (5), 1.98 (6), 1.67 (7), 2.43 (8) and 1.65 µM (9) against A2780 ovarian cancer cell lines. Additionally, the leaves of Cleistochlamys kirkii (Annonaceae) from Tanzania have been recently shown to be a rich source of polyoxygenated cyclohexene derivatives with antiplasmodial activities, along with very potent activities against MDA-MB-231 triple-negative human breast cancer cell line . The isolates; cleistodienediol (10), cleistodienol A (11), cleistodienol B (12), cleistenechlorohydrin A (13), cleistenechlorohydrin B (14), cleistenediol F (15), cleistophenolide (16), ent-subglain C (17) and melodorinol (18, PubChem CID: 6438687) showed some activities as low as IC50 = 0.09 µM against the aforementioned cancer cell lines. To the best of our knowledge, mode of action studies have not yet been conducted for the SMs 1 to 18 and in vivo activity data is currently unavailable.
|Cpd. No.*||Molecule class||Source species (Family)||Cancer cell line||IC50 (µM)||Biosynthetic pathway||References|
|1||lignan||Cleistanthus boivinianus (Phyllanthaceae)||HCT-116 human colon carcinoma cell line||0.03||shikimic acid pathway, via phenylalanine|||
|A2780 ovarian cancer cell line||0.02|
|5||butanolide||Ocotea macrocarpa (Lauraceae)||”||2.57|||
|10||polyoxygenated cyclohexene derivative||Cleistochlamys kirkii (Annonaceae)||MDA-MB-231 triple-negative human breast cancer cell line||0.03||Shikimic acid pathway|||
7. Case studies
In this section, we shall discuss specific examples of the investigation of biosynthesis of anticancer plant-based SMs by (computational) analysis of genomic data.
7.1. Biogenesis of several anticancer metabolites by Ocimum tenuiflorum (Lamiaceae)
Species from the genus Ocimum are well known for their high medicinal values and are therefore used to cure a variety of ailments in Ayurveda, an Indian system of medicine [97, 98]. About 30 SMs have been reported from the genus Ocimum, with a variety of biological properties . Only 14 of these SMs belong to the five basic groups of compounds having a complete biosynthetic pathway information in the PMN database [49, 50], thereby leaving us with ~15 medicinally relevant metabolites from Ocimum sp. with unknown pathways. This has prompted further investigation on SMs with uncharted biosynthetic pathways. Several bioactive SMs, including the anticancer compounds; apigenin (19, PubChem CID: 5280443), rosmarinic acid (20, PubChem CID: 5281792), taxol (21, PubChem CID: 36314), ursolic acid (22, PubChem CID: 64945), oleanolic acid (23, PubChem CID: 10494) and the plant steroid sitosterol (24, PubChem CID: 222284) have been identified from the herb Krishna Tulsi (O. tenuiflorum, Lamiaceae), with the mature leaves retaining the medicinally relevant metabolites . Upadhyay et al. carried out a draft genome analysis of the species and generated paired-end and mate-pair sequence libraries for the whole sequenced genome, together with transcriptomic analysis (RNA-Seq) of two subtypes of O. tenuiflorum (Krishna and Rama Tulsi) and reporting the relative expression of genes in the both varieties. The authors further investigated the pathways, which lead to the biosynthesis of the identified SMs, with respect to similar pathways in A. thaliana and other model plants (e.g. Oryza sativa japonica). Six important genes (including Q8RWT0 and F1T282) were expressed and identified from analysis of genome data. These were validated by q-RT-PCR on the different studied tissues (e.g. roots, mature leaves, etc.) of five closely related species (e.g. O. gratissimum, O. sacharicum, O. kilmund,Solanum lycopersicum and Vitis vinifera), which showed a high extent of urosolic acid-producing genes in young leaves. The other identified anticancer metabolites included eugenol and ursolic acid. As an example, the authors employed sequence search algorithms to search for the three enzymes of the three-step synthetic pathway of ursolic acid from squalene in the Tulsi genome. Each of these enzymes in Tulsi (squalene epoxidase, α-amyrin synthase and α-amyrin 2,8 monoxygenase) were queried from the PlantCyc database, starting from their protein sequences. The search for analogous enzymes in the model plants O. sativa japonica and A. thaliana, showed sequence identity covering from 50 to 80% of the query length. The whole genome and sequence analysis of O. tenuiflorum suggested that small amino acid changes at the functional sites of genes involved in metabolite synthesis pathways could confer special medicinal (particularly anticancer) properties to this herb.
7.2. Biosynthesis of the anticancer alkaloid noscapine by Papaver somniferum (Papaveraceae)
Noscapine (25, PubChem CID: 275196) is an antitumour phthalideisoquinoline alkaloid from opium poppy (Papaver somniferum, Papaveraceae). Compound 25 is known to bind stoichiometrically to tubulin, alters its conformation, affects microtubule assembly (promotes microtubule polymerisation), hence arresting metaphase and inducing apoptosis in many cell types . It has been demonstrated that the compound has potent antitumour activity against solid murine lymphoid tumours (even when the drug was administered orally). This drug has also shown potency against human breast, ovarian and bladder tumours implanted in nude mice and in dividing human cells [102, 103]. Although the compound is water-soluble and absorbed after oral administration, its chemotherapeutic potential in human cancer could not be fully exploited for drug discovery projects because, like most SMs, this has been limited by the typically small amounts produced in the slow-growing plant species . The quest to improve production levels of the NP is essential for drug discovery. However, such would require a proper understanding biological processes underlying the biosynthesis of this SM, known from isotope-labelling experiments to be derived from scoulerine since the 1960s . Winzer et al. have carried out a transcriptomic analysis, with the aim of elucidating the biosynthetic pathway of this important metabolite for the improvement of its commercial production in both poppy and other systems . The analysis of a high noscapine-producing poppy variety, HN1, showed the exclusive expression of 10 genes encoding five distinct enzyme classes, whereas five functionally characterised genes (BBE, TNMT, SaIR, SaIAT and T6ODM) were present in all three of the studied poppy varieties, respectively, rich in morphine, thebaine and noscapine (HM1, HN1 and HT1). The authors analysed the expressed sequence tag (EST) abundance and discovered some previously uncharacterised genes expressed in HN1, which were completely absent from the other (HM1 and HT1) EST libraries. This led to the identification of the corresponding enzymes as three O-methyltransferases (PSMT1, PSMT2, PSMT3), four cytochrome P450s (CYP82X1, CYP82X2, CYP82Y1 and CYP719A21), an acetyltransferase (PSAT1), a carboxylesterase (PSCXE1) and a short-chain dehydrogenase/reductase (PSSDR1). Further analysis of an F2 mapping population, using HN1 and HM1 as parents, indicated that these genes are tightly linked in HN1. Moreover, bacterial artificial chromosome sequencing confirmed the existence of a complex BGC for plant alkaloids. Based on the knowledge derived from the investigation, the authors could make suggestions for the improved production of noscapine and related bioactive molecules by the molecular breeding of commercial poppy varieties or engineering of new production systems, for example, by virus-induced gene silencing, which resulted in the accumulation of pathway intermediates, thus allowing gene function to be linked to noscapine synthesis [104, 106].
7.3. Biosynthesis of vinblastine and vincristine by Catharanthus roseus (Apocynaceae)
Vinblastine (26, PubChem CID: 13342) and vincristine (27, PubChem CID: 5978) are chemotherapy drugs used to treat a number of cancer types. These are among the >120 known terpenoid indole alkaloids from the medicinal plant C. roseus, also known as the Madagascar periwinkle . Since these two very important anticancer compounds have only been produced in very low amounts in C. roseus, as opposed to the fairly high levels of several monomeric alkaloids (e.g. ajmalicine and serpentine) , attempts to improve the yields of compounds 26 and 27 have led to the genome-wide transcript profiling of elicited C. roseus cell cultures, by cDNA-amplified fragment-length polymorphism combined with metabolic profiling . This resulted in the identification of several gene-to-gene and gene-to-metabolite networks obtained by an attempt to establish correlations between the expression profiles of 417 gene tags and the accumulation profiles of 178 metabolite peaks. The results proved that different branches of terpenoid indole alkaloid biosynthesis and various other metabolic pathways are affected by differences in hormonal regulation. Thus, the investigations of Rischer et al. provided the foundations for a proper understanding of secondary metabolism in C. roseus, thereby enhancing the applicability of metabolic engineering of Madagascar periwinkle. This study provided the possibility of exploring a select number of genes (e.g. STR, 10HGO, T16H and DAT) involved in biosynthesis of terpenoid indole alkaloids .
8. The way forward
The case studies show that the detailed computational analysis of the transcriptomic and metabolomic data of a plant species could reveal its metabolic capacity and hence help identify candidate genes involved in the biosynthesis of the important SMs it contains. Thus, modifying the plant genes could represent a premise for improving metabolite yield. It should be mentioned that other compounds from some of the aforementioned compound classes (Table 3), from both floral and microbial sources, have shown promising anticancer activities [109–113], e.g. isolinderanolide B (28, PubChem CID: 53308122) (Figure 3), a butanolide from the stems of Cinnamomum subavenium (Lauraceae) had shown antiproliferative activity in T24 human bladder cancer cells by blocking cell cycle progression and inducing apoptosis . In addition, subamolide B (29, PubChem CID: 16104907), another butanolide from this same species, is known to induce cytotoxicity in human cutaneous squamous cell carcinoma through mitochondrial and CHOP-dependent cell death pathways . Meanwhile, obtusilactone B (30, PubChem CID: 101286261), from Machilus thunbergii (Lauraceae), is known to target barrier-to-autointegration factor to treat cancer .
From the African flora, apart from the Lauraceae, Phyllanthaceae and Annonaceae, known to be rich in anticancer metabolites, the genus Tacca of the yam family (Dioscoreaceae) is known for the abundant presence of taccalonolides, which are microtubule stabilisers with clinical potential for cancer treatment . Additionally, the genus Tamarix (e.g. T. aphyllaand T. nilotica from Northern Africa), together with the genus Reaumuria (Tamaricaceae) are known for the abundant presence of tannins (gallo-ellagitannin, gallotannins) with remarkable cytotoxic effects. The high salt content of the leaves of Tamarix species, rendering them useful locally as a fire barrier, and their adaptability to drought and high salinity are of equal interest. It therefore becomes urgent to investigate the genomics of some of the aforementioned plant species, particularly those from the Cinnamomum sp., Ocotea sp. and Machilus sp., (Lauraceae), Tacca sp. (Dioscoreaceae), Cleistanthus sp. (Phyllanthaceae), Cleistochlamys sp. (Annonaceae), Tamarix sp. (Tamaricaceae) and so on, and hence further investigate the genes or BGCs responsible for secondary metabolism with the view of understanding and better exploring the biosynthetic pathways of the anticancer SMs.
It has been our intention in this chapter to provide a detailed overview of the important computational tools and resources for the analysis of plant genomic data and for the prediction of biosynthetic pathways in plants. We have taken a few case studies of anticancer SMs to illustrate this. Even though it is unclear how widespread plant genes are clusters, genes that encode the biosynthesis of several small plant SMs are well known, including the vital genes for the production of some highly potent anticancer drugs. With the use of the tools and databases described, along with the drop in the cost of whole genome sequencing in plant species, the future for the discovery of new plant-based anticancer metabolites would involve the identification of one or more genes or BGCs encoding the enzymes in the biosynthetic pathway for the target compound(s), followed by the co-expression analysis, also exploiting the knowledge of the chemical structure of the target compound, for the identification of other enzymes that might be involved in this pathway. As an example, the exploration of the pathway for podophyllotoxin biosynthesis by the use transcriptome mining in Podophyllum hexandrum led to the identification biosynthetic genes, 29 of which were combinatorially expressed in the tobacco plant (Nicotiana benthamiana), leading to the identification of six pathway enzymes, among which is oxoglutarate-dependent dioxygenase responsible for closing the core cyclohexane ring of the aryltetralin scaffold . An alternative approach could be, if the metabolic pathway and nature of SMs are unknown, then the identified co-expressed genes encoding the enzymes for secondary metabolism could be subjected to untargeted metabolomics for the elucidation of unknown pathways and chemical structures. As an example, a single pathogen-induced P450 enzyme, CYP82C2, with a combination of untargeted metabolomics and co-expression analysis was used to uncover the complete biosynthetic pathway, which leads to the metabolite 4-hydroxyindole-3-carbonyl nitrile, previously unknown to Arabidopsis sp. This rare and hitherto unprecedented plant metabolite, with a cyanogenic functionality revealed a hidden capacity of Arabidopsis sp. for cyanogenic glucoside biosynthesis. This was confirmed by expressing 4-OH-ICN engineering biosynthetic enzymes in Saccharomyces cerevisiae and Nicotiana benthamiana, to reconstitute the complete pathway in vitro and in vivo, thus validating the functions of the enzymes involved in the pathway .
FNK acknowledges a Georg Forster fellowship from the Alexander von Humboldt Foundation, Germany. CVS is currently a doctoral candidate financed by the German Academic Exchange Services (DAAD), Germany.
|AfroCancer||African Anticancer Natural Products Database|
|BGC||Biosynthetic gene clusters|
|EC50||Half maximal effective concentration, that is, the concentration of a drug, antibody or toxicant, which induces a response halfway between the baseline and maximum after a specified exposure time|
|ED50||The median effective dose, a dose that produces the desired effect in 50% of a population|
|FBA||Flux Balance Analysis|
|GI50||The growth inhibition of 50%, drug concentration resulting in a 50% reduction in the net protein increase.|
|IC50||The drug concentration causing 50% inhibition of the desired activity|
|IMG-ABC||The Integrated Microbial Genomes Atlas of Biosynthetic gene Clusters|
|NANPDB||Northern African Natural Products Database|
|NPACT||Naturally Occurring Plant-based Anti-cancer Compound Activity-Target Database|
|NRPS||Nonribosomal peptide synthase|
|PMN||Plant Metabolic Network|
|PRISM||PRediction Informatics for Secondary Metabolomes|
- The authors declare that they have no competing interests.