Open access peer-reviewed chapter - ONLINE FIRST

Chemoinformatic Approach: The Case of Natural Products of Panama

By Dionisio A. Olmedo and José L. Medina-Franco

Submitted: March 29th 2019Reviewed: June 7th 2019Published: July 31st 2019

DOI: 10.5772/intechopen.87779

Downloaded: 89

Abstract

Chemoinformatic analysis was used to characterize a compound database of natural products from Panama and other reference collections. Data mining allowed to compare drug-likeness properties with public and commercial software and to achieve a statistical analysis of the physicochemical properties. Visualization of the chemical space in 3D indicates a high structural similarity. Molecular flexibility and complexity were evaluated using 2D descriptors, whereas the molecular scaffold was obtained using the Murcko method, and these showed few differences between the explored data set. In this chapter, we also present and discuss an example of the application of the chemoinformatic approach using the concept of modeling the activity landscape to study the structure-activity relationships (SARs) of compounds with activity against Plasmodium falciparum.

Keywords

  • chemoinformatic
  • complexity
  • data mining
  • physicochemical properties
  • scaffold

1. Introduction

Natural products (NPs) and their derivatives constitute a significant fraction of approved drugs [1, 2, 3], bioactive compounds [4, 5, 6, 7, 8], and lead compounds for drug discovery [9]. NP fragment has been used to guide the synthesis of bioactive compounds and generate BIOS combinatorial libraries [10, 11, 12, 13, 14, 15]. NPs have structures with different substituent patterns, giving rise to different biological activities for compounds with very similar structures [16, 17, 18, 19]. These bioactive metabolites have greater affinity for biological targets and, overall, may have better bioavailability than synthetic compounds, and the presence of pan-assay interference compounds (PAIN) is less frequent in this type of product [20]. The chemoinformatic analysis of several databases of NPs developed by academic institutions and private companies [21] has been carried out in different countries. Thus, the following databases were obtained: BIOFACQUIM [22], CIFPMA [23], NuBBE [24, 25], NANPDB [26], TCM [27], HIT [28], and NPACT [29]. The application of chemoinformatic tools involves the generation, manipulation, and analysis of data set of chemical substances. This allows us through mathematical calculations to order, develop, and evaluate structural information that can be visualized in 2D and 3D [30]. The determination of the physicochemical properties carried out on different databases of NPs and principal component analysis (PCA) was used as an approximation to display the chemical spaces [22, 23, 24, 31, 32, 33, 34, 35, 36, 37].

Computational exploration of NPs has increased in recent years, giving greater relevance to studies that include structural diversity metrics calculated with parameters based on distances such as Euclidean distance, Manhattan distances, and Cosine distance. Other criteria are based on circular fingerprint (ECFP-4, ECFP-6) [22, 23, 24, 38, 39, 40, 41, 42, 43, 44, 45] and fingerprint based on substructure (MACCS, PubChem) [22, 23, 24, 39, 40, 41, 42, 43, 44, 45]. Another metric used in NPs is the comparison by similarity that uses the Tanimoto index/Tanimoto coefficient [22, 23, 24, 45, 46, 47, 48, 49].

In this study, the molecular scaffolds of natural products have been obtained using the Murcko method [22, 23, 24, 50, 51, 52, 53, 54, 55, 56, 57]. Meanwhile, the molecular complexity is frequently evaluated by descriptors in 2D such as fraction of sp3 hybridized carbons (Fsp3) [23], fraction of chiral centers (FCC) [23], and globularity [22, 23, 24, 58, 59, 60, 61, 62, 63].

An update of the Natural Products Database from the University of Panama (UPMA) containing 454 compounds (Unpublished data) has been evaluated against different therapeutic targets such as cytotoxicity bioassay in cell lines, antifungal assay in vitro, parasites of tropical diseases (Leishmania sp., Plasmodium falciparum, and Trypanosoma cruzi), and the bioassay against HIV-1 virus, demonstrating an inhibitor effect on protease, reverse transcriptase, nuclear factor NFkappaB, and Tat protein affecting the viral replication. These are the most significant biological targets in which the natural products from Panama present bioactivity. The values of their biological activities are represented as percentages in Figure 1.

Figure 1.

Biological endpoints and targets in which natural products from Panama present bioactivity.

2. Application of chemoinformatic antimalarial databases: case of natural products from Panama

2.1 Preparation curated and processing of data set

In this chapter, we present a chemoinformatic analysis of natural products with antimalarial activities (in vitro), expressed as pIC50 against sensitive and resistant strains. Databases of natural products with antimalarial activity (NPAs) were constructed in-house by reviewing published articles including those compounds that were isolated and characterized by spectroscopic techniques of nuclear magnetic resonance. Around 1312 compounds were compared to 8 reference data sets: an open database, DrugBank (antimalarial drug), European Bioinformatics Institute. (CHEMBL drug indications) (antimalarial activities), Open Source Drug Discovery (OSDD) Malaria, Malaria Box (Medicines for Malaria Venture (MMV)), St. Jude Children’s Research Hospital (St. Jude), Novartis (GNF Malaria Box), and GlaxoSmithKline (GSK) Tres Cantos antimalarial set. All data sets were curated using the “Wash” function implemented in the Molecular Operating Environment (MOE2018.0101) software [64]. The structure of the studied compounds was represented by simplified molecular input line entry system (SMILES) notation, thus obtaining 20,364 unique molecules that are summarized in Table 1. The difference between initial compounds and unique compounds is due to the fact that during the data preparation (curation process), the duplicate compounds are eliminated, those that have positive or negative partial loads have neutralized their protonation states, the metals are disconnected, and the energy is minimized using the molecular mechanistic force field (MMFF94). The result of the data curation is the reduction of the initial number of molecules present in the databases evaluated in this work.

DatabasesInitial compoundsUnique compoundsSource
Natural Products Antimalarial (NPAs)13531312Databases of NP in house
DrugBank Version 5.0. (Drug Antimalarial)264https://www.drugbank.ca
European Bioinformatics Institute. (CHEMBL Drugs Indications) (Antimalarial activities2724[https://www.ebi.ac.uk/chembl]
Open Source Drug Discovery (OSDD) Malaria9388http://opensourcemalaria.org/
Malaria Box-Medicine of Malaria Venture (MMV)124124https://www.ebi.ac.uk/chembl/malaria/source
St. Jude Children’s Research Hospital’s1.4781.478https://www.ebi.ac.uk/chemblntd
Novartis-GNF Malaria Box4.8784.868Available in:https://www.ncbi.nlm.nih.gov/pmc/articles/PMC3941073/
Available in:https://www.ebi.ac.uk/chemblntd
GlaxoSmithKline Tres Cantos Antimalarial12.47012.466Open Source Malaria (GSK-TCMDC). Available in:https://www.ebi.ac.uk/chemblntd

Table 1.

Databases analyzed with chemoinformatic tools.

2.2 Molecular descriptors

The descriptors of physicochemical properties, hydrogen bond acceptors (HBAs), hydrogen bond donors (HBDs), number of rotatable bonds (NRBs), the octanol/water partition coefficient (logP), topological polar surface area (TPSA), and molecular weight (MW), or others such as molar refractivity, are important physicochemical parameters for quantitative structure-activity relationship (QSAR) analysis. These molecular descriptors are based on Lipinski’s rule and Verger’s rule regarding the prediction of the pharmacological similarity of orally active pharmacological potential [65, 66, 67]. The statistical analysis of the physicochemical properties was realized with RStudio Software 1.0.136 AGPL [68].

2.3 3D visualization of chemical space of compounds with antimalarial activity

PCAs were done with MOE software [64], and the dominant characteristics are expressed as covariance and visualized with the corresponding 2D or 3D graphic score plot with DataWarrior program v. 5.0 [69]. Figures 2, 3, 4, 5, 6, 7, 8 showed the distribution of different compounds with antimalarial activities in the chemical spaces.

Figure 2.

3D visualization of the chemical space of natural product databases.

Figure 3.

3D visualization of the chemical space of synthetic compounds.

Figure 4.

3D visualization of the chemical spaces of natural products and GNF DBs.

Figure 5.

3D visualization of the chemical spaces of natural products and TCMDC DBs.

Figure 6.

3D visualization of the chemical spaces of natural products and DBK DBs.

Figure 7.

3D visualization of the chemical spaces of natural products, OSM and St. Jude.

Figure 8.

3D visualization of the chemical spaces of all databases.

Figure 9.

Curve for cumulative frequency distribution (CFD) based on ECFP-4.

In Figures 2, 3, 4, 5, 6, 7, 8 we observed that NPs, drugs, and synthetic compounds occupy, in general, similar chemical space and are overlapping in most of the evaluated databases.

2.4 Molecular diversity based on fingerprints

Three binary molecular fingerprints were calculated with RStudio package rcdk: Extended connectivity fingerprints with diameter 4 (ECFP-4) for similarity searching, molecular access system (MACCS) keys of 166 bits for determining similarity and molecular diversity, and PubChem keys of 881 bits for encoding molecular fragment information [42, 43, 44]. The similarity of fingerprints by structural pairs of compounds was calculated with the Tanimoto coefficient and analyzed with the cumulative distribution function (CDF). This approach has been used to calculate, measure, and represent the molecular variety of compound data sets [23].

Figures 9, 10, 11 show the CDFs of the pairwise similarity of the different data sets evaluated with Tanimoto coefficient and ECPF-4, MACCS keys, and PubChem fingerprints, respectively.

Figure 10.

Curve for cumulative frequency distribution based on MACCS keys.

Figure 11.

Curve for cumulative frequency distribution based on PubChem.

Figure 12.

Cyclic system retrieval curves for all databases evaluated in this study.

Figures 9, 10, 11 provide information on the structural diversity of the six databases. Similar approach has been previously published [23]; the curves obtained with ECFP-4 did not prove to be a suitable fingerprint representation for these data sets. In the three similarity graphs based on fingerprints, it is shown that the database of natural products with antimalarial activity, OMS, and MMV has the lowest molecular diversity, while GSK DB was the most diverse.

In Tables 24, the statistical values of the pairwise Tanimoto similarity with the data sets analyzed are shown. In these tables, CHEMBL and DrugBank databases are excluded from our analysis, due to the small amount of data.

Similarity ECFP-4/Tanimoto coefficient
DBsMin.1st Qu.MedianMean3rd Qu.Max.
GSK0.017240.057890.088440.114900.122450.82353
NPs0.000000.078260.099100.105650.123891.00000
OSM0.000000.078260.099170.106070.123971.00000
MMV0.000000.078260.099240.106150.124031.00000
ST JUDE0.000000.081970.103450.109800.128571.00000
GNF0.000000.082090.103450.107720.127391.00000

Table 2.

The statistical values of the similarity of the Tanimoto coefficient with ECFP-4.

Similarity MACCS keys/Tanimoto coefficient
DBsMin.1st Qu.MedianMean3rd Qu.Max.
GSK0.078130.256820.333330.370090.455810.92683
NPs0.000000.344260.436360.446730.545451.00000
OSM0.000000.344830.436360.446930.545451.00000
MMV0.000000.344830.436360.446770.544121.00000
ST JUDE0.000000.333330.412500.423130.500001.00000
GNF0.000000.317460.394370.399990.476191.00000

Table 3.

The statistical values of the similarity of the Tanimoto coefficient with MACCS keys.

Similarity PubChem/Tanimoto coefficient
DBsMin.1st Qu.MedianMean3rd Qu.Max.
GSK0.081250.245000.375550.402630.540021.00000
NPs0.036840.322980.438020.461840.586211.00000
OSM0.036840.323400.439020.462530.587301.00000
MMV0.036840.324440.440330.463210.587911.00000
ST JUDE0.036840.382240.471430.476240.561951.00000
GNF0.000000.405980.481170.478000.554461.00000

Table 4.

The statistical values of the similarity of the Tanimoto coefficient with PubChem.

2.5 Molecular scaffolds: content and diversity

2.5.1 Scaffold content

Murcko scaffolds were calculated with the program Molecular Equivalent Indices (MEQI) [50, 51] and DataWarrior program [69]. MEQI has been used to obtain the codes corresponding to the chemotypes most frequently analyzed in the databases. [23, 45, 52, 53, 54, 55]. The distribution and diversity of the molecular scaffolds present in the data sets were calculated and analyzed using the cyclic system retrieval (CSR) curves [42]. These curves were obtained by plotting the fraction of scaffold and the fraction of compounds that contain cyclic systems [43, 44].

Table 5 indicates that the MMV DB (0.491) was the most diverse in scaffold content taken as reference the F50 values compared to the data set from GSK (0.183), NPs (0.168), and GNF (0.161), respectively. CSR curves on Figure 12 further confirm the relative scaffold variety of the eight databases. The analysis of area under curve (AUC) metrics associated with the CSR curves is reported in Table 5. The CSR curves showed that MMV has more variety in scaffold content with AUC value of 0.507. In contrast OSM, NPs, GNF, GSK, St. Jude, and CHEMBL were the least diverse (e.g., AUC scores of 0.745, 0.712, 0.705, 0.698, 0.655 and 0.607, respectively). The CSR curves provide information on the diversity of the most frequent scaffolds in all databases.

DBsNumber of Compounds (M)Unique chemotypes (N)FN/MNSINGFNSING/MFNSING/NSAUCF50
NPs12986290.48464000.30820.63590.71250.1685
DBK551.000051.00001.00000.48000.4000
CHEMBL24180.7500160.66670.88890.60720.3333
OSM89390.4382270.30340.69230.74530.1025
MMV1241220.98391200.96770.98360.50790.4918
St. JUDE9154790.52353250.35520.67850.65510.2474
GNF486032290.664426900.55350.83310.70540.1615
GSK12,46367030.537850090.40190.74730.69820.1837

Table 5.

Summary of the scaffold diversity of the eight databases analyzed in this work.

M = number of molecules in the BD, N = number of chemotypes or substructures, FN/M = chemotype diversity fraction, NSING = singleton number, FNSING/M = singleton fraction between total molecules, FNSING/N = fraction of singleton among total chemotypes, AUC = area under the curve, F50 = fraction of chemotype required to recover 50% of the molecules.

Figure 13.

Scaled Shannon entropy of the most frequent scaffolds with values ranging from 10 to 40 in natural products.

2.5.2 Shannon entropy (SE) and scaled Shannon entropy (SSE)

The Shannon entropy has been adapted to measure the scaffold diversity based on the (N) number of most recurrent scaffolds [70]. The scaled Shannon entropy is a normalized value that measures the most common chemotypes present in a database. Thus, SSE closer to 1 indicates higher scaffold diversity, while SSE closer to zero (0) indicates lower diversity. In this study, we calculated the SSE for values ranging from N = 10 to N = 40.

Figure 13 shows a histogram with the distribution of the 40 most populated scaffolds in NPAs. The histogram includes the corresponding chemotype code. The comparison of the scaffolds of the NPAs allowed the identification of the 68MBD chemotype as one of the most active compounds in this database.

Figure 14.

Distribution of the fraction of sp3 hybridized carbons in different databases.

2.5.3 Molecular complexity and flexibility

The structural descriptors used to quantify fraction of sp3 hybridized carbons (Fsp3) [23, 58, 63, 70], fraction of chiral centers (CCF) [23, 59, 63, 70], fraction of aromatic atoms (Faro-atm), globularity [60], principal moments of inertia (PMI), normalized principal moments of inertia ratio (NRP) [61, 62], molecular complexity, shape index of Kier, and molecular flexibility were calculated with DataWarrior program [69] and MOE 2018.0101 [64]. Figures 14, 15, 16, 17, 18, 19 showed the descriptors utilized to evaluate the complexity and the molecular flexibility.

Figure 15.

Distribution of the fraction of chiral centers in different databases.

Figure 16.

Distribution of the fraction of aromatic atoms (Faro-atm) in different databases.

Figure 17.

Shape index distribution of different databases.

Figure 18.

Distribution of the molecular flexibility in different databases.

Figure 19.

Distribution of the molecular complexity in different databases.

Figure 20.

Structural similarity compared with activity cliffs in NPAs.

Tables 68 summarize the statistics of the distribution of Fsp3, FCC, and Faro-atm of NPs and reference data sets. These results indicate that the NP data set has the largest complexity molecular in Fsp3 (0.63) and CCF (0.16) and a low distribution of Faro-atm (0.67–0.78). In contrast, GNF, MMV, St. Jude, and GSK DBs are very similar in these three metrics with values between 0.25 and 0.37, 0.27 and 0.37, and 0.014 and 0.025, respectively. In contrast, the structural flexibility was evaluated with the index of form presenting all databases in the range of 0.41–0.58 indicating that many of the compounds present sphericity and intermediate molecular flexibility (data not presented).

Fraction of sp3 hybridized atoms (Fsp3)
DBsMin1qstmedianmean3qrtmaxdev.st
NPs0.0000.4810.6360.6560.8332.0000.254
CHEMBL0.1670.3420.5360.6210.6271.3330.374
MMV0.0000.1670.3000.3160.4020.8000.190
OSM0.0000.1740.2550.2770.3380.8930.145
DBK0.2500.4380.5190.4630.5450.5650.175
GNF0.0000.2270.3640.3770.5002.6670.207
STJUDE0.0000.2220.3330.3530.4711.1360.178
GSK0.0000.2500.3750.3720.5001.5000.180

Table 6.

Distribution of Fsp3 in different databases.

Fraction of chiral centers (CCF)
DBsmin1qstmedianmean3qrtmaxdev.st
NPs0.0000.0330.1390.1610.2670.6560.145
CHEMBL0.0000.0000.0360.1280.1410.5330.192
MMV0.0000.0000.0000.0140.0000.1110.028
OSM0.0000.0000.0000.0080.0000.2860.035
DBK0.0000.0000.0190.0200.0400.0430.024
GNF0.0000.0000.0000.0250.0400.5560.053
STJUDE0.0000.0000.0000.0240.0450.2170.037
GSK0.0000.0000.0000.0170.0340.5000.033

Table 7.

Distribution of FCC in different databases.

Fraction of aromatic atoms (Faro-atm)
DBsmin1qstmedianmean3qrtmaxdev.st
NPs0.0000.0000.3240.3410.6001.1330.294
CHEMBL0.0000.2990.5560.5090.6901.0910.321
MMV0.2610.6820.8260.8170.9561.4290.230
OSM0.0000.6770.7330.7860.8601.5000.232
DBK0.5380.5910.7330.7200.8620.8750.171
GNF0.0000.5220.6670.6700.8181.7140.235
STJUDE0.0000.5530.7120.7080.8571.5560.216
GSK0.0000.5710.7060.7130.8571.4000.208

Table 8.

Distribution of fraction of aromatic atoms.

The descriptors globularity, PMI, and NRP did not prove to be suitable metrics to measure and differentiate the molecular complexity in the data sets evaluated. This is because the corresponding values computed for all data sets were very low (close to zero) and did not differentiate the data sets (data not shown). The large molecular complexity of NPs measured is in agreement with previous studies using similar metrics [23, 63, 71].

3. Activity landscape modeling

The methods of modeling the landscape based on properties of the compounds (property landscape modeling (PLM)) is at the interface between experimental sciences and computational chemistry, being a frequent strategy to systematically describe the structure-property relationships (SPR) of the compound data set [72]. PLM have been used in medicinal chemistry in the stages of drug discovery with a quantitative, descriptive, and statistical approach to activity cliffs [72, 73, 74]. Structure-activity relationships (SARs), using the concept of modeling the activity landscape (activity landscape modeling ALM), are an increasing common practice in the drug discovery process to identify the activity cliffs, guide the optimization of compound hits, and to avoid the deleterious effects of the activity cliffs in the studies of the classic models of QSAR and in the search of structural similarity. In this research we analyze, through the web tool Activity Landscape Plotter (ALP) [72], a set of data from NPs from Panama with antimalarial activity against four strains of Plasmodium falciparum in the erythrocyte gametocyte stage (Figures 20 and 24).

Figure 21.

Structural similarity compared with activity cliffs in GSK and Novartis (GNF).

Figure 22.

SAS maps of compounds with antimalarial activity ((a), (b), and (c)) through the web tool activity landscape plotter.

The generation and comparison of structure-activity pairs, by structure-activity similarity maps (SAS map). The SAS map has been used to link up structure and biological activity, based on a systematic pairwise comparison of all the compounds in a data set analyzed. We compare the values of structure-activity similarity, the activity difference, and structure-activity landscape index (SALI) to find the pairs of compounds with high molecular similarity and the activity difference that are located in the upper right quadrant of the SAS map (activity cliffs) [72, 73, 74, 75, 76]. Figures 17, 18, 19 show SAS map in NP of Panama, NP published, GSK, and GNF. In SAS maps, data points are colored by density (Figure 22).

Figure 23.

DAS map with MACCS key fingerprint.

The SAS maps using the molecular fingerprints EFCP-4, MACCS keys, and PubChem led to the identification of a total of 26 pairs of compounds with structure-activity similarity ratios >0.50 and structure-activity landscape index values varying between 0.3 and 5.0. The web application Activity Landscape Plotter [72] is a tool that allows us to perform QSAR. The SAS generated represent 55 natural products isolated in Panama with antimalarial activity which were analyzed and compared the biological activities against strains of Plasmodium falciparum sensitive, resistant and multiresistant. The analysis with the parameters the (SAS / Tanimoto index / ECFP-4), a total of twenty-six pairs of compounds showed similarity values greater than 70%, sixteen pairs greater than 80% and only two pairs of compounds gave a similarity greater than 85%. While with activity cliffs, only three pairs of compounds show structural similarity correlated with the values of pIC50 activity [72, 77].

SAS maps are color-coded according to their intensity and we observe that most pairs of compounds with antimalarial activity show an intense red color. Analyzed are located in the region of little structural similarity, indicating that the natural products have high structural diversity and low difference in activity, attributed to having similar functional groups in their molecules.

DAS maps represent the pairwise activity differences for each possible pair of compounds in an evaluated data set, against two biological targets. These maps permitted to differentiate if a structural modification can increase or decrease the activity under one target or other (Figure 23).

Figure 24.

Antimalarial compounds in NPs from Panama.

With this web application, we have carried out a QSAR study in a fast, simple, and easily interpretable way, obtaining three natural products as leading computational compounds for their optimization as Plasmodium falciparum blockers, which exhibit a gametocidal activity [78] (Figure 24).

4. Conclusion

The chemoinformatic analysis of the 20,364 compounds (1312 NPs and 19,052 synthetic (MMV, OSM, GNF, St. Jude, GSK, CHEMBL, and DrugBank)) indicates that so many natural products and synthetic products (S) share the same chemical space showing molecules that have similar structural properties. NPs present a greater diversity based on fingerprint than the synthetic compounds. Also, NPs have a higher proportion of chiral carbons and atoms with sp3 hybridization and greater complexity, while synthetic products contain a greater proportion of aromatic atoms. Finally, concerning the properties related to cyclicity, relative shape, and flexibility, all have very similar values, which could explain the antimalarial activity of computationally determined compound hits in this work against Plasmodium falciparum-sensitive (3D7, D6, poW, D10) and chloroquine-resistant strains (W2, Dd).

Acknowledgments

The DAO acknowledges the SNI 2018 awards from SENACYT of Panama.

Conflict of interest

The authors declare that there are no financial or commercial conflicts of interest.

Download

chapter PDF

© 2019 The Author(s). Licensee IntechOpen. This chapter is distributed under the terms of the Creative Commons Attribution 3.0 License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.

How to cite and reference

Link to this chapter Copy to clipboard

Cite this chapter Copy to clipboard

Dionisio A. Olmedo and José L. Medina-Franco (July 31st 2019). Chemoinformatic Approach: The Case of Natural Products of Panama [Online First], IntechOpen, DOI: 10.5772/intechopen.87779. Available from:

chapter statistics

89total chapter downloads

More statistics for editors and authors

Login to your personal dashboard for more detailed statistics on your publications.

Access personal reporting

We are IntechOpen, the world's leading publisher of Open Access books. Built by scientists, for scientists. Our readership spans scientists, professors, researchers, librarians, and students, as well as business professionals. We share our knowledge and peer-reveiwed research papers with libraries, scientific and engineering societies, and also work with corporate R&D departments and government entities.

More About Us