Molecular Electrostatic Potential and Chemometric Techniques as Tools to Design Bioactive Compounds

In this chapter, firstly, we briefly review aspects of the approximation of quantum chemistry, molecular electrostatic potential (MEP), and chemometrics techniques, which are accredited as important tools in the development of chemical science and are frequently used in the study and design of bioactive compounds. Ultimately, we use MEP and pattern recognition (PR) techniques as tools to design nitrofuran compounds with biological activity against Trypanosoma cruzi ( T. cruzi ). PR models (PCA, HCA, KNN, SDA, and SIMCA) were constructed and demonstrated that 23 nitrofurans can be classified into two classes or groups: more active and less active according to their degrees of activity against T. cruzi . Properties such as charge on the N atom of the nitro group (QN1); the difference between the highest occupied molecular orbital (HOMO) energy and the lowest unoccupied molecular orbital (LUMO) energy (GAP energy); molecular representation of structure based on electron diffraction code of signal 5, unweighted (Mor05u); and Moriguchi water–octanol partition coefficient (MlogP) are responsible for the classification into more active and less active studied nitrofurans. It is interesting to notice that these properties represent three distinct classes of interactions between the nitrofurans and the biological receptor: electronic (QN1 and GAP energy), steric (Mor05u), and hydrophobic (MlogP). The results of the application of PR models on the validation set evidenced two nitrofuran compounds (compounds 25 and 30 ) as more promising for synthesis and biological assays, which in the future can be used to validate our PR models.

Chemometrics is a discipline that collects mathematical, statistical, information theory, and computer science tools to deal with complex chemical data [19][20][21][22]. PR techniques were introduced in the chemistry, at the beginning of the 1970s, to analyze various types of spectroscopic data. Since then, PR became part of chemometrics and has been an excellent tool to aid in the interpretation of chemical data to obtain relevant information in different application sectors of chemical science [19,20]. PR techniques are especially useful for the classification of objects into discrete classes on the basis of measured features. A set of characteristic features of an object is considered as an abstract pattern that contains information about a not directly measured property of the object [19].
The MEP and PR techniques have been used as independent strategies in the study of active compounds and lead to the proposal of new molecules for synthesis and biological testing. The joint applications of these powerful tools were described carefully, to unravel the structure-activity relationship of bioactive compounds, consequently proposing new molecules. Therefore, a more intense exploration of its potentials is needed in order to design biologically active compounds.
The design of molecules with a desired property is one of the objectives of chemoinformatics. In this chapter, we present a study of the application of MEP and PR techniques to design nitrofuran compounds with potential activity against T. cruzi. In the first step of our study, MEP maps will be used in an attempt to identify the key structural features of nitrofuran compounds that are necessary for their activities and investigate their probable interactions with a molecular receptor through recognition in a biological process. Subsequently, PR techniques are used to construct models that will be applied later to a forecast set constructed with the accumulated perceptions in the MEP studies.

MEP and chemometrics techniques as tools for the design of bioactive compounds: a brief review
According to the literature, MEP [1,3] has been a tool of quantum chemistry used by researchers for several decades to study and understand the relationships between structure and activity of molecules. Among the papers that point out the importance of this tool in the matter, and consequently in the planning of bioactive compounds, we can mention those reported by Bernardinelli et al. [23] and by Jefford et al. [24].
Another tool, in the form of a set of techniques has been used emphatically over the years in the understanding of the structure-activity relationship of molecules is Chemometrics [25][26][27]. This set of techniques has also enables the planning of new biologically active compounds, and most of the developed research is focused on the construction of QSAR (quantitative structure-activity relationship) models.
The combination of MEP and chemometrics as tools for designing new bioactive compounds has almost always been focused on the elaboration of quantitative models, for example, the CoMFA methodology [28]. This methodology was developed in the late 1980s by Cramer et al. [29]. Its application is richly extensive and recently it has been used in several studies of structure-activity relationships of bioactive  [31]. Cramer applied the CoMFA methodology for a large majority of 116 biological targets and obtained acceptable 3D-QSAR models [32]. Cramer et al. introduced in the literature a novel alignment methodology for training or test set structures in 3D-QSAR [33]. Dong et al. performed QSAR analyses of aromatic heterocycle thiosemicarbazone analogues for finding novel tyrosinase inhibitors [34]. Dong et al. built 3D-QSAR models of dabigatran analogues as thrombin inhibitors [35]. Ding et al. performed 3D-QSAR models of 6-aryl-5-cyanopyrimidine derivatives to explore the structure requirements of LSD1 inhibitors [36].
Applications of MEP to investigate the key features of compounds that are necessary for their biological activities and thus proposing new derivatives as well as the construction of chemometric models as indicative of the most promising among the new derivatives for syntheses and biological assays were reported by us in literature [37][38][39][40][41][42][43]. Pinheiro

Biological recognition process ligand/receptor through the molecular electrostatic potential
The MEP is also suitable for analyzing processes based on the "recognition" of one molecule by another as in drug-receptor and enzyme-substrate interactions, because it is through their potentials that the two species first "see" each other [2,3,[44][45][46].
MEP for the electronic density is a very useful property for understanding the site of electrophilic attack and nucleophilic reactions as well as the hydrogen bonding interactions [46]. The MEP at a given point (x, y, z) in the vicinity of a molecule is defined in terms of the interaction energy between the electrical charge generated from the molecule's electrons and nuclei and a positive charge test (a proton) located at → r . Being a real physical property, MEP can be determined experimentally by diffraction or by computational tools [3]. For the studied nitrofuran molecules, the MEP values were computed through Eq. (1) [45] Cheminformatics and Its Applications 4 where K is the number of nuclei with charges Z j , located at position R j and ρ ( → r ) is the electronic charge density. The first term on the right side of Eq. (1) represents the contribution of the nuclei, which is positive; the second term brings in the effect of the electrons, which is negative. In the investigation of the reactive sites of nitrofuran compounds, the MEP was evaluated through of the HF/6-31G method.

Principal component analysis (PCA) technique
When computing large multivariate data, it is mandatory to find and reduce unknown data trends using exploratory tools. The main idea of the PCA technique is to reduce the dimensionality of a data set consisting of large numbers of interrelated variables while retaining the variation present in the data set as much as possible. This can be achieved by transforming them into a new set of variables, the PCs, which are uncorrelated and ordered so that the first few retain most of the variation present in all of the original variables. As the final result, the PCA technique performs the selection of a small number of variables (molecular properties) considered better related to the dependent property or feature [67], in this study, the biological activity against T. cruzi.

Hierarchical cluster analysis (HCA) technique
This technique has become, together with PCA, another important tool in pattern recognition [67]. The purpose of using it is to display the data in such a way as to emphasize its natural clusters and patterns in a two-dimensional space. The results are presented as dendrograms. In HCA technique, the distances between objects or variables are calculated and computed through the similarity index which ranges from zero, that is, no similarity and large distance among objects, to one, for identical objects.

K-nearest neighbor (KNN) technique
The KNN technique [67] classifies the objects based on distance comparison among them. The multivariate Euclidean distances between every pair of objects with known class membership are calculated. The closest K objects are used to build the model. The optimal K is determined by cross-validation applied to the training set objects. The classification of a test object is determined based on the multivariate distance of this object with respect to the K objects in the training set. In this technique no assumption is made about the size and shape of the training set classes.

Stepwise discriminant analysis (SDA) technique
This technique separates objects from distinct populations and allocates new objects into populations previously defined. It uses a stepwise procedure in which, at each step, the most powerful variable is entered into the discriminant function. The SDA technique is anchored in the F-test for the significance of variables and at each step selects a variable based on its significance, and, after several steps, the most significant variables are extracted from the set in question [20,68]. where Q is typically greater than one. It provides three possible outcome predictions: the sample fits only one pre-defined category, the sample does not fit any of the pre-defined categories, and the sample fits into more than one pre-defined category [67].

Computers, software, compounds, and molecular descriptors
For the present chapter, we performed molecular calculations on an AMD PHENOM 955 X4 2.2 GHz processor with 4 Gb of RAM with the Gaussian 98 program package [69]. The MEP was computed from the electronic density, and the maps were displayed using the MOLEKEL software [70], while the PR models were carried out on a PC Pentium machine with the Pirouette program [71]. Figure 1 shows the 2D structure of the 5-nitrofuran-2-aldoxim molecule [72] used in the selection of method/basis set (see Section 3.1.3.1). In Figures 2 and 3 the 2D structures of the nitrofuran compounds from the training [73][74][75] and prediction sets are displayed, respectively. In this work, the nitrofuran molecules were defined as more active against T. cruzi, when in vitro growth rate inhibition (GR) T. cruzi ≥ 75, and as less active when in vitro growth rate inhibition T. cruzi < 75.
In general, the structure-activity relationship shows that for the compounds 1-6, the increase in the carbon chain improves the activity against T. cruzi. The comparison between compounds 3 and 2 evidences increased activity by the substitution of the N atom by O. We can also notice that increasing the number of unsaturations and returning the nitrogen to the chain will lead to a decrease in biological activity (7,8). Still in relation to compound 1, increasing the unsaturations, returning the atom of O, and increasing the carbon chain length (9-12) substantially increase the activity against T. cruzi. On the other hand, in compounds 13 and 14, returning to an unsaturation in the main chain and introducing electron-withdrawing groups and more electronegative atoms, there is a decrease in chagasic activity. This evidence can also be verified for compounds 16, 17, 19-22.
The molecular descriptors were obtained for the most stable conformation of each compound. These descriptors were computed to give information about the influence of electronic, steric, hydrophilic, and hydrophobic features on the antitrypanosomal activity of the studied nitrofurans. The atomic charges in this work were derived from the electrostatic potential obtained with HF/6-31G method/basis set as implemented in the Gaussian program package. The electrostatic potential is obtained through the calculation of a set of punctual atomic charges so that it represents the possible best quantum molecular electrostatic potential for a set of points defined around the molecule [76,77]. The charges derived from electrostatic potential present the advantage of being, in general, physically more satisfactory than the charges of Mülliken [78], especially with regard to biological activity.
Molecular holistic (MH) descriptors were included with the purpose of representing different sources of chemical information in terms of molecular size, symmetry, and distribution of atoms in molecules. Also, we include topologic indices, connectivity indices, geometric descriptors, 3D-MoRSE descriptors, and Moriguchi octanol-water partition coefficient (MlogP). These descriptors were calculated with the Dragon software [80].

Quantum-chemical approach and basis set selection for the description of the geometries of nitrofurans
The advantage in using the PCA and HCA techniques in this step was that all structural information are considered simultaneously and it takes into account the correlations among them. Table 1 shows the theoretical and experimental structural information (bond lengths and bond angles) of the geometry of the 5-nitrofuran-2-aldoxim molecule. It was used with the aim to select using PCA and HCA techniques, which quantum-chemical approach and basis set give results closest to the experimental data [72].
The first two principal components explain 86.02% of the original information as follows: PC1 = 58.01% and PC2 = 28.02%. The PC1 versus PC2 scores plot is shown in Figure 4, from which it can be seen that the methods are discriminated into two classes according to PC2. The semiempirical approaches (AM1 and PM3) are at the top of the graph, while the other theoretical (HF, BLYP, and B3LYP) approaches and experimental data are at the bottom. Moreover, it can be seen that the HF/6-31G approach/basis set is the closest to the experimental data, indicating that they should be used in the development of this work.
Also, to investigate the most appropriate approach and basis set for further calculations, we used HCA. Figure 5 shows the dendrogram obtained with complete linkage method; from this figure, we conclude that the theoretical approaches are distributed in a similar way as in PCA, i.e., HCA confirmed the PCA results. Moreover, we can observe that the HF/6-31G approach/basis set is closer to the experimental data therefore being the most suitable to carry out this work. Figure 6 shows the MEP maps for the nitrofurans in the training set. The analysis of these maps reveals that the most active compounds, in general, have the following characteristics:

MEP maps for compounds of the training set
(i) Compounds with an unsaturation and presenting O atom neighboring the carbonyl in the carbonic chain present greater electron density in the proximities of the furan ring with the decrease of the chain size. In these compounds (4, 5, and 6) (ii) Compounds with double unsaturation, containing O atom neighboring the carbonyl, raising the carbon chain, increase the electron density in the atoms of the     (23), the MEP map shows a negative region (red and yellow) between −73.10 and − 1.59 kcal/mol on the mentioned atoms and positive region between +5.56 and 69.91 kcal/mol (green and blue). The electron density around the nitro group, the O atom of the furan ring, and other atoms may induce the nitrofurans to show antitrypanosomal activity, suggesting the complexation in those regions with the active site of the receptor in a biological recognition process.
From the above discussion, as a rule, to plan more active nitrofurans, we can assume we resort to one of the basic structures of the most active compounds and introduce groups of atoms or substituents electron donors enhancing the key structural features that are necessary for their activities.

Chemometric modeling
To perform the chemometric modeling, all variables were auto-scaled as preprocessing so that they could be standardized and so they could have the same importance regarding the scale. Furthermore, given a large quantity of multivariate data available, it was necessary to reduce the number of variables. Thus if any two descriptors had a high Pearson correlation coefficient (r ˃ 0.8), one of the two was excluded from the matrix at random, since theoretically they describe the same property [88]; they also have a high correlation with antitrypanosomal activity, and only one of them is enough to be used as independent variable in a predictive model.

PCA model
Four molecular descriptors were selected for PCA model. The molecular descriptors (QN1, gap energy, Mor05u, and MlogP), in vitro T. cruzi growth inhibition (experimental data), and activity and correlation matrix including all data for 23 nitrofurans can be seen in Table 2. The correlation between descriptors is less than 0.786. The first three principal components (PCs) describing 96.48 of the original information for the 23 are as follows: 45.70, 30.91, and 19.87%. PC1-PC2 scores for the samples are shown in Figure 7. From this figure, we can see that the nitrofurans are distributed into two distinct regions in PC1. The more active compounds are on the left side (4-7, 10-12, 18, and 23) and the less active on the right side (1-3, 8, 9, 13-17, and 19-22). According to Figure 8, the MlogP descriptor is responsible for displaying more active compounds on the left side, while the gap energy, QN1, and Mor05u descriptors displayed fewer active compounds for the right side from this figure.

HCA model
The results of the HCA model are displayed in the dendrogram in Figure 9 and are similar to those of PCA model. The nitrofurans are fairly well grouped according to their activity. From this figure, the two clusters (+ and −) mirror the same two classes displayed by PCA model (Figure 7). Table 4 shows the results for the KNN models obtained with the KNN technique and constructed with one (1NN) to four (4NN) nearest neighbors. To all models the percentage of correct information was 100%. We used the model 4NN because the greater the number of the nearest neighbors, the better the reliability of the KNN technique, and the same was used for validation of the training set from Figure 2.

SDA model
In the construction of the SDA model, the discrimination functions for groups more active and less active, respectively, are given below: Also, through the discrimination functions, Eqs. (3) and (4), and of the value of each descriptor for the nitrofurans, we obtain the classification matrix by using all compounds from the training set ( Table 5). The classification error was 0.00% resulting in a satisfactory separation of more active and less active compounds. From SDA model, the allocation rule was derived when the activity against T. cruzi of new nitrofurans is investigated: (a) initially calculate, for the new compound, the value of the most important descriptors obtained in the construction of the SDA model, (b) put these auto-scaled values in the two discrimination functions  In order to check the reliability of the model, the "leave-one-out technique" was employed. One nitrofuran compound is excluded from the data set, and the remaining compounds are used in building the classification functions.
Subsequently, the removed analogue is classified according the generated classification functions. In the further step, the omitted compound is included, and a new nitrofuran is removed, and the procedure goes on until the last compound is removed. In Table 6 the results obtained with the cross-validation model are summarized.

SIMCA model
The SIMCA model were built with the same descriptors as PCA, HCA, KNN, and SDA models and used two (2) PCs in the modeling of the two classes: more active nitrofurans (4-7, 10-12, 18, and 23) and less active (1-3, 8, 9, 13-17, and 19-22) nitrofurans. In Table 7, the obtained results for the SIMCA model are shown. In this case, the information percentage was also 100%. According to the PCA, HCA, KNN, SDA, and SIMCA models, we can also notice that the QN1, gap energy, Mor05u, and MlogP descriptors are key properties for explaining the anti-T. cruzi activity of the nitrofurans training set (Figure 2).
As QN1, gap energy, Mor05u, and MlogP properties were selected in the chemometric modeling as the most important characteristics to describe the antitrypanosomal activity, some considerations about them may be relevant to the understanding of the behavior of more active nitrofurans. According to classical chemical theory, chemical interactions can be classified in two categories: electrostatic (polar) or orbital (covalent). Electrical charges in the molecule are indubitably the impelling cause of electrostatic interactions. It has been demonstrated that local electron densities or charges are important in many chemical reactions, physicochemical properties, and ligand-receptor interactions [89,90]. Thus, charge-based parameters have been widely employed as chemical reactivity indices or as measures of weak intermolecular interactions. Many quantum-chemical descriptors are derived from the partial charge distribution in a molecule or from the electron densities on particular atoms [91]. From Table 2, we can observe that, in general, QN1 for more active analogues must present lower values than the less active ones. This is an indication that biological processes can occur through electrostatic interactions between the more active nitrofurans and an eventual biological receptor.
Gap energy is an important stability index. A large gap energy implies high stability for the molecule in the sense of its lower reactivity in chemical reactions.   It is an approximation of the lowest excitation energy of the molecule and can be used for the definition of absolute and activation hardness [89,90]. In Table 2, we can observe that, in general, the more active nitrofurans present lower gap energy than the less active ones. This indicates that the more active nitrofurans have a great probability of interacting with the biological receptor through a charge transfer mechanism.
Mor05u is a 3D-MoRSE descriptor based on the idea of obtaining information from 3D atomic coordinates through the transformed used in electrons diffraction studies [91] and is strictly related to the stereochemistry of the compounds [92]. According to Table 2, the more active nitrofurans present lower values of Mor5u. This may be, in general, an indication of the importance of the stereochemical properties of the more active nitrofurans in a possible mechanism of action of its own.
MlogP is an important hydrophobic descriptor in diverse biochemical, pharmacological, and toxicological processes involved in drug absorption [93]. As identified in Table 2, the more active reported nitrofurans exhibit the higher MlogP values. This is an indication that in processes involving nitrofurans and a biological receptor, hydrophobic interactions may be important in the mechanism of action of these compounds.
Knowing the performance of the RP models constructed for the 23 studied nitrofurans, we decided to apply them to a series of eight compounds (Figure 3) designed to maintain the key structural features that are necessary for their biological activities evidenced by the MEP maps of the compounds of the training set. The basic nucleus of these compounds corresponds to that of the most active nitrofurans with double unsaturation, containing vicinal O atom to carbonyl (see compounds 10-12). The eight molecules proposed for the study of prediction of activity were drawn with the help of one of the collaborators of this work, who belong to the research group in organic chemistry of the Federal University of Pará, Brazil, and the most promising syntheses are in progress. In the future, antitrypanosomal tests with the most promising nitrofurans can be used to validate our RP models.
The results obtained of the application of the PR models (PCA, HCA, KNN, SDA, and SIMCA) and the descriptors for the compounds of the prediction set are summarized in Tables 8 and 9, respectively. In Table 8, the compounds 25 and 30 were predicted as more active against T. cruzi with the five models. Only the KNN model predicted compound 26 as the most active. Meanwhile, only the PCA and HCA models predicted compound 31 as the most active. On the other hand, all models, except the SDA model, predicted compounds 24, 27, and 28 as the most active. In turn, the SIMCA model did not classify compounds 29 and 31 into any of the two classes. Thus, we can consider nitrofurans 25 and 30 as potentially more active in a future test against T. cruzi. For the values reported for compounds 25 and 30 (Table 9), it can be shown that in order to design more active nitrofurans we must combine smaller values for the descriptors QN1, gap energy, and Mor05u with higher value for the descriptor MlogP.   Figure 10 shows the MEP maps for the most active nitrofurans in the validation set (25 and 30). Also, in these compounds, as can be seen, raising the carbon chain increases the electron density in the atoms of the nitro group, extending through the O of the furan ring to the O atoms of the ester group accompanying the unsaturated chain. In these compounds, the MEP maps show more negative values between −74.27 and − 1.76 kcal/mol (red and yellow). They exhibit positive MEP in the range + 4.84 to +57.58 kcal/mol (green and blue).

MEP maps for compounds of the prediction set
The negative MEP region of compounds 25 and 30, similar to the more active compounds in the training set, is susceptible to attack in a biological recognition process.

Concluding remarks
MEP and chemometric techniques in the last decades have become efficient tools in the study of the structure-activity relationships of bioactive molecules. The use of such tools has occurred through the inherent principles of each or combining their potentials to more efficiently unravel information about the structure-activity relationships of pharmacologically interesting compounds. This chapter is circumscribed in this second possibility. MEP maps were constructed for 23 nitrofurans with activity against T. cruzi reported in the literature. The key structural features  Table 9.
Values for descriptors for the prediction set.
required for antitrypanosomal activity, along with chemical intuition, allowed the introduction of substituents in one of the most active nitrofurans in the training set to obtain eight new derivatives. PR models (PCA, HCA, KNN, SDA, and SIMCA) were constructed and demonstrated that 23 nitrofurans can be classified into two classes or groups: more active and less active according to their degrees of activity against T. cruzi. The properties QN1, gap energy, Mor05u, and MlogP are responsible for the classification into more active and less active studied nitrofurans. It is interesting to notice that these properties represent three distinct classes of interactions between the nitrofurans and the biological receptor: electronic (QN1 and gap energy), steric (Mor05u), and hydrophobic (MlogP). Here it is important to mention that Paulino et al., studying the influence of molecular parameters on the activity of 5-nitrofurans against T. cruzi, reported the importance of electronic properties and molecular hydrophobicity as well as the variation of the nitrofurans electronic structure to explain the greater activity of these compounds as inhibitors of the growth of this protozoan [94].
The results of the application of PR models on the validation set evidenced two nitrofurans (25 and 30) as more promising for synthesis and biological assays, which in the future can be used to validate our PR models.