Databases analyzed with chemoinformatic tools.
Chemoinformatic analysis was used to characterize a compound database of natural products from Panama and other reference collections. Data mining allowed to compare drug-likeness properties with public and commercial software and to achieve a statistical analysis of the physicochemical properties. Visualization of the chemical space in 3D indicates a high structural similarity. Molecular flexibility and complexity were evaluated using 2D descriptors, whereas the molecular scaffold was obtained using the Murcko method, and these showed few differences between the explored data set. In this chapter, we also present and discuss an example of the application of the chemoinformatic approach using the concept of modeling the activity landscape to study the structure-activity relationships (SARs) of compounds with activity against Plasmodium falciparum.
- data mining
- physicochemical properties
Natural products (NPs) and their derivatives constitute a significant fraction of approved drugs [1, 2, 3], bioactive compounds [4, 5, 6, 7, 8], and lead compounds for drug discovery . NP fragment has been used to guide the synthesis of bioactive compounds and generate BIOS combinatorial libraries [10, 11, 12, 13, 14, 15]. NPs have structures with different substituent patterns, giving rise to different biological activities for compounds with very similar structures [16, 17, 18, 19]. These bioactive metabolites have greater affinity for biological targets and, overall, may have better bioavailability than synthetic compounds, and the presence of pan-assay interference compounds (PAIN) is less frequent in this type of product . The chemoinformatic analysis of several databases of NPs developed by academic institutions and private companies  has been carried out in different countries. Thus, the following databases were obtained: BIOFACQUIM , CIFPMA , NuBBE [24, 25], NANPDB , TCM , HIT , and NPACT . The application of chemoinformatic tools involves the generation, manipulation, and analysis of data set of chemical substances. This allows us through mathematical calculations to order, develop, and evaluate structural information that can be visualized in 2D and 3D . The determination of the physicochemical properties carried out on different databases of NPs and principal component analysis (PCA) was used as an approximation to display the chemical spaces [22, 23, 24, 31, 32, 33, 34, 35, 36, 37].
Computational exploration of NPs has increased in recent years, giving greater relevance to studies that include structural diversity metrics calculated with parameters based on distances such as Euclidean distance, Manhattan distances, and Cosine distance. Other criteria are based on circular fingerprint (ECFP-4, ECFP-6) [22, 23, 24, 38, 39, 40, 41, 42, 43, 44, 45] and fingerprint based on substructure (MACCS, PubChem) [22, 23, 24, 39, 40, 41, 42, 43, 44, 45]. Another metric used in NPs is the comparison by similarity that uses the Tanimoto index/Tanimoto coefficient [22, 23, 24, 45, 46, 47, 48, 49].
In this study, the molecular scaffolds of natural products have been obtained using the Murcko method [22, 23, 24, 50, 51, 52, 53, 54, 55, 56, 57]. Meanwhile, the molecular complexity is frequently evaluated by descriptors in 2D such as fraction of sp3 hybridized carbons (Fsp3) , fraction of chiral centers (FCC) , and globularity [22, 23, 24, 58, 59, 60, 61, 62, 63].
An update of the Natural Products Database from the University of Panama (UPMA) containing 454 compounds (Unpublished data) has been evaluated against different therapeutic targets such as cytotoxicity bioassay in cell lines, antifungal assay in vitro, parasites of tropical diseases (Leishmania sp., Plasmodium falciparum, and Trypanosoma cruzi), and the bioassay against HIV-1 virus, demonstrating an inhibitor effect on protease, reverse transcriptase, nuclear factor NFkappaB, and Tat protein affecting the viral replication. These are the most significant biological targets in which the natural products from Panama present bioactivity. The values of their biological activities are represented as percentages in Figure 1.
2. Application of chemoinformatic antimalarial databases: case of natural products from Panama
2.1 Preparation curated and processing of data set
In this chapter, we present a chemoinformatic analysis of natural products with antimalarial activities (in vitro), expressed as pIC50 against sensitive and resistant strains. Databases of natural products with antimalarial activity (NPAs) were constructed in-house by reviewing published articles including those compounds that were isolated and characterized by spectroscopic techniques of nuclear magnetic resonance. Around 1312 compounds were compared to 8 reference data sets: an open database, DrugBank (antimalarial drug), European Bioinformatics Institute
. (CHEMBL drug indications) (antimalarial activities), Open Source Drug Discovery (OSDD) Malaria, Malaria Box (Medicines for Malaria Venture (MMV)), St. Jude Children’s Research Hospital (St. Jude), Novartis (GNF Malaria Box), and GlaxoSmithKline (GSK) Tres Cantos antimalarial set. All data sets were curated using the “Wash” function implemented in the Molecular Operating Environment (MOE2018.0101) software . The structure of the studied compounds was represented by simplified molecular input line entry system (SMILES) notation, thus obtaining 20,364 unique molecules that are summarized in Table 1. The difference between initial compounds and unique compounds is due to the fact that during the data preparation (curation process), the duplicate compounds are eliminated, those that have positive or negative partial loads have neutralized their protonation states, the metals are disconnected, and the energy is minimized using the molecular mechanistic force field (MMFF94). The result of the data curation is the reduction of the initial number of molecules present in the databases evaluated in this work.
|Databases||Initial compounds||Unique compounds||Source|
|Natural Products Antimalarial (NPAs)||1353||1312||Databases of NP in house|
|DrugBank Version 5.0. (Drug Antimalarial)||26||4|
|European Bioinformatics Institute. (CHEMBL Drugs Indications) (Antimalarial activities||27||24||[|
|Open Source Drug Discovery (OSDD) Malaria||93||88|
|Malaria Box-Medicine of Malaria Venture (MMV)||124||124|
|St. Jude Children’s Research Hospital’s||1.478||1.478|
|Novartis-GNF Malaria Box||4.878||4.868||Available in:|
|GlaxoSmithKline Tres Cantos Antimalarial||12.470||12.466||Open Source Malaria (GSK-TCMDC). Available in:|
2.2 Molecular descriptors
The descriptors of physicochemical properties, hydrogen bond acceptors (HBAs), hydrogen bond donors (HBDs), number of rotatable bonds (NRBs), the octanol/water partition coefficient (logP), topological polar surface area (TPSA), and molecular weight (MW), or others such as molar refractivity, are important physicochemical parameters for quantitative structure-activity relationship (QSAR) analysis. These molecular descriptors are based on Lipinski’s rule and Verger’s rule regarding the prediction of the pharmacological similarity of orally active pharmacological potential [65, 66, 67]. The statistical analysis of the physicochemical properties was realized with RStudio Software 1.0.136 AGPL .
2.3 3D visualization of chemical space of compounds with antimalarial activity
PCAs were done with MOE software , and the dominant characteristics are expressed as covariance and visualized with the corresponding 2D or 3D graphic score plot with DataWarrior program v. 5.0 . Figures 2, 3, 4, 5, 6, 7, 8 showed the distribution of different compounds with antimalarial activities in the chemical spaces.
2.4 Molecular diversity based on fingerprints
Three binary molecular fingerprints were calculated with RStudio package rcdk: Extended connectivity fingerprints with diameter 4 (ECFP-4) for similarity searching, molecular access system (MACCS) keys of 166 bits for determining similarity and molecular diversity, and PubChem keys of 881 bits for encoding molecular fragment information [42, 43, 44]. The similarity of fingerprints by structural pairs of compounds was calculated with the Tanimoto coefficient and analyzed with the cumulative distribution function (CDF). This approach has been used to calculate, measure, and represent the molecular variety of compound data sets .
Figures 9, 10, 11 provide information on the structural diversity of the six databases. Similar approach has been previously published ; the curves obtained with ECFP-4 did not prove to be a suitable fingerprint representation for these data sets. In the three similarity graphs based on fingerprints, it is shown that the database of natural products with antimalarial activity, OMS, and MMV has the lowest molecular diversity, while GSK DB was the most diverse.
In Tables 2–4, the statistical values of the pairwise Tanimoto similarity with the data sets analyzed are shown. In these tables, CHEMBL and DrugBank databases are excluded from our analysis, due to the small amount of data.
|Similarity ECFP-4/Tanimoto coefficient|
|DBs||Min.||1st Qu.||Median||Mean||3rd Qu.||Max.|
|Similarity MACCS keys/Tanimoto coefficient|
|DBs||Min.||1st Qu.||Median||Mean||3rd Qu.||Max.|
|Similarity PubChem/Tanimoto coefficient|
|DBs||Min.||1st Qu.||Median||Mean||3rd Qu.||Max.|
2.5 Molecular scaffolds: content and diversity
2.5.1 Scaffold content
Murcko scaffolds were calculated with the program Molecular Equivalent Indices (MEQI) [50, 51] and DataWarrior program . MEQI has been used to obtain the codes corresponding to the chemotypes most frequently analyzed in the databases. [23, 45, 52, 53, 54, 55]. The distribution and diversity of the molecular scaffolds present in the data sets were calculated and analyzed using the cyclic system retrieval (CSR) curves . These curves were obtained by plotting the fraction of scaffold and the fraction of compounds that contain cyclic systems [43, 44].
Table 5 indicates that the MMV DB (0.491) was the most diverse in scaffold content taken as reference the F50 values compared to the data set from GSK (0.183), NPs (0.168), and GNF (0.161), respectively. CSR curves on Figure 12 further confirm the relative scaffold variety of the eight databases. The analysis of area under curve (AUC) metrics associated with the CSR curves is reported in Table 5. The CSR curves showed that MMV has more variety in scaffold content with AUC value of 0.507. In contrast OSM, NPs, GNF, GSK, St. Jude, and CHEMBL were the least diverse (e.g., AUC scores of 0.745, 0.712, 0.705, 0.698, 0.655 and 0.607, respectively). The CSR curves provide information on the diversity of the most frequent scaffolds in all databases.
|DBs||Number of Compounds (M)||Unique chemotypes (N)||FN/M||NSING||FNSING/M||FNSING/NS||AUC||F50|
2.5.2 Shannon entropy (SE) and scaled Shannon entropy (SSE)
The Shannon entropy has been adapted to measure the scaffold diversity based on the (N) number of most recurrent scaffolds . The scaled Shannon entropy is a normalized value that measures the most common chemotypes present in a database. Thus, SSE closer to 1 indicates higher scaffold diversity, while SSE closer to zero (0) indicates lower diversity. In this study, we calculated the SSE for values ranging from N = 10 to N = 40.
Figure 13 shows a histogram with the distribution of the 40 most populated scaffolds in NPAs. The histogram includes the corresponding chemotype code. The comparison of the scaffolds of the NPAs allowed the identification of the 68MBD chemotype as one of the most active compounds in this database.
2.5.3 Molecular complexity and flexibility
The structural descriptors used to quantify fraction of sp3 hybridized carbons (Fsp3) [23, 58, 63, 70], fraction of chiral centers (CCF) [23, 59, 63, 70], fraction of aromatic atoms (Faro-atm), globularity , principal moments of inertia (PMI), normalized principal moments of inertia ratio (NRP) [61, 62], molecular complexity, shape index of Kier, and molecular flexibility were calculated with DataWarrior program  and MOE 2018.0101 . Figures 14, 15, 16, 17, 18, 19 showed the descriptors utilized to evaluate the complexity and the molecular flexibility.
Tables 6–8 summarize the statistics of the distribution of Fsp3, FCC, and Faro-atm of NPs and reference data sets. These results indicate that the NP data set has the largest complexity molecular in Fsp3 (0.63) and CCF (0.16) and a low distribution of Faro-atm (0.67–0.78). In contrast, GNF, MMV, St. Jude, and GSK DBs are very similar in these three metrics with values between 0.25 and 0.37, 0.27 and 0.37, and 0.014 and 0.025, respectively. In contrast, the structural flexibility was evaluated with the index of form presenting all databases in the range of 0.41–0.58 indicating that many of the compounds present sphericity and intermediate molecular flexibility (data not presented).
|Fraction of sp3 hybridized atoms (Fsp3)|
|Fraction of chiral centers (CCF)|
|Fraction of aromatic atoms (Faro-atm)|
The descriptors globularity, PMI, and NRP did not prove to be suitable metrics to measure and differentiate the molecular complexity in the data sets evaluated. This is because the corresponding values computed for all data sets were very low (close to zero) and did not differentiate the data sets (data not shown). The large molecular complexity of NPs measured is in agreement with previous studies using similar metrics [23, 63, 71].
3. Activity landscape modeling
The methods of modeling the landscape based on properties of the compounds (property landscape modeling (PLM)) is at the interface between experimental sciences and computational chemistry, being a frequent strategy to systematically describe the structure-property relationships (SPR) of the compound data set . PLM have been used in medicinal chemistry in the stages of drug discovery with a quantitative, descriptive, and statistical approach to activity cliffs [72, 73, 74]. Structure-activity relationships (SARs), using the concept of modeling the activity landscape (activity landscape modeling ALM), are an increasing common practice in the drug discovery process to identify the activity cliffs, guide the optimization of compound hits, and to avoid the deleterious effects of the activity cliffs in the studies of the classic models of QSAR and in the search of structural similarity. In this research we analyze, through the web tool Activity Landscape Plotter (ALP) , a set of data from NPs from Panama with antimalarial activity against four strains of Plasmodium falciparum in the erythrocyte gametocyte stage (Figures 20 and 24).
The generation and comparison of structure-activity pairs, by structure-activity similarity maps (SAS map). The SAS map has been used to link up structure and biological activity, based on a systematic pairwise comparison of all the compounds in a data set analyzed. We compare the values of structure-activity similarity, the activity difference, and structure-activity landscape index (SALI) to find the pairs of compounds with high molecular similarity and the activity difference that are located in the upper right quadrant of the SAS map (activity cliffs) [72, 73, 74, 75, 76]. Figures 17, 18, 19 show SAS map in NP of Panama, NP published, GSK, and GNF. In SAS maps, data points are colored by density (Figure 22).
The SAS maps using the molecular fingerprints EFCP-4, MACCS keys, and PubChem led to the identification of a total of 26 pairs of compounds with structure-activity similarity ratios >0.50 and structure-activity landscape index values varying between 0.3 and 5.0. The web application Activity Landscape Plotter  is a tool that allows us to perform QSAR. The SAS generated represent 55 natural products isolated in Panama with antimalarial activity which were analyzed and compared the biological activities against strains of Plasmodium falciparum sensitive, resistant and multiresistant. The analysis with the parameters the (SAS / Tanimoto index / ECFP-4), a total of twenty-six pairs of compounds showed similarity values greater than 70%, sixteen pairs greater than 80% and only two pairs of compounds gave a similarity greater than 85%. While with activity cliffs, only three pairs of compounds show structural similarity correlated with the values of pIC50 activity [72, 77].
SAS maps are color-coded according to their intensity and we observe that most pairs of compounds with antimalarial activity show an intense red color. Analyzed are located in the region of little structural similarity, indicating that the natural products have high structural diversity and low difference in activity, attributed to having similar functional groups in their molecules.
DAS maps represent the pairwise activity differences for each possible pair of compounds in an evaluated data set, against two biological targets. These maps permitted to differentiate if a structural modification can increase or decrease the activity under one target or other (Figure 23).
With this web application, we have carried out a QSAR study in a fast, simple, and easily interpretable way, obtaining three natural products as leading computational compounds for their optimization as Plasmodium falciparum blockers, which exhibit a gametocidal activity  (Figure 24).
The chemoinformatic analysis of the 20,364 compounds (1312 NPs and 19,052 synthetic (MMV, OSM, GNF, St. Jude, GSK, CHEMBL, and DrugBank)) indicates that so many natural products and synthetic products (S) share the same chemical space showing molecules that have similar structural properties. NPs present a greater diversity based on fingerprint than the synthetic compounds. Also, NPs have a higher proportion of chiral carbons and atoms with sp3 hybridization and greater complexity, while synthetic products contain a greater proportion of aromatic atoms. Finally, concerning the properties related to cyclicity, relative shape, and flexibility, all have very similar values, which could explain the antimalarial activity of computationally determined compound hits in this work against Plasmodium falciparum-sensitive (3D7, D6, poW, D10) and chloroquine-resistant strains (W2, Dd).
The DAO acknowledges the SNI 2018 awards from SENACYT of Panama.
Conflict of interest
The authors declare that there are no financial or commercial conflicts of interest.