Average performances of QSAR models on test sets selected with the hierarchical clustering, k-means clustering, MaxMin, representativeness, and Pareto-based methods. Averaging performed over the two data sets, the five model building algorithms, and the different sizes of the selected test sets.
Modeling complex phenomena in chemoinformatics and material-informatics can often be formulated as single-objective or multi-objective optimization problems (SOOPs or MOOPs). For example, the design of new drugs or new materials is inherently a MOOP since drugs/materials require the simultaneous optimization of multiple parameters.
- simulated annealing
- outlier removal
- machine learning
Modeling complex phenomena in chemoinformatics and material-informatics can often be formulated as multi-variables/single-objective or multi-variables/multi-objectives optimization problems. Common examples of the former include sampling of complex energy landscapes using conformational search methods , docking or molecular simulation techniques [2, 3], derivation of statistical models, namely quantitative structure–activity relationship (QSAR) models [4, 5], and diversity or representativeness analysis [6, 7]. The design of new compounds with pharmaceutical relevance on the other hand is inherently a multi-objective optimization problem (MOOP) since drugs require the simultaneous optimization of many parameters and consequently constitute a compromise between often conflicting requirements. In a similar manner, the design of new materials could also be regarded as a MOOP. For example, the design of new photovoltaic cells requires the simultaneous optimization of the current and the voltage.
This chapter focuses on optimization algorithms from three different areas: (1) Representativeness analysis, that is, the selection of a representative subset from within a parent data set. Representativeness analysis has multiple applications in chemoinformatics and material-informatics, for example, for rationally selecting subsets of compounds for experimental analysis and for rationally partitioning a parent data set into a modeling set and a test set. (2) Derivation of predictive, nonlinear machine-learning models correlating activities with descriptors while inherently incorporating feature selection to identify the most relevant descriptors. Such the so-called QSAR models find many usages in chemistry, biology, environmental sciences, and material sciences to predict the activities of new compounds/materials and to provide insight into the factors governing these activities . (3) Outlier removal, to clean a parent data set from objects (e.g., compounds) that may demonstrate abnormal behavior. Outlier removal is a mandatory step prior to the derivation of any statistical model. In all cases, the corresponding problems (i.e., how to select a representative subset from within a parent data set, how to remove outliers from within a parent data set, and how to build predictive QSAR models) were formulated as either single-objective optimization problems (SOOPs) or MOOPs. These problems were then solved using Monte Carlo (MC)/Simulated Annealing (SA) or Genetic Algorithm (GA) as the optimization engines.
Finally, while developed independently and with multiple potential applications in mind, all algorithms could be used as components of machine-learning workflows for data mining and the derivation of QSAR models (see Section 5).
This chapter is organized as follows: We begin with a short introduction to SOOPs and MOOPs (Section 2), followed by a description of two of the most common optimization engines, MC/SA (Section 3.1) and GA (Section 3.2). Section 4 provides a description of the different optimization-based algorithms, and Section 5 lists a few representative examples. Finally, Section 6 concludes the chapter.
2. Single-objective and multi-objectives optimization
As noted above, multiple problems in chemoinformatics and material-informatics could be formulated as optimization problems. Such a formulation requires the definition of a target (or objective) function(s) (f) and a set of variables (X1, X2, X3,…,Xn), which are related to the scientific problem of interest, and which together define a complex, multi-dimensional surface with an a priori unknown distribution of optima. The task then is to locate the global optimum or preferably, since phenomena in these fields are rarely governed by a single solution, a set of optima.
Optimization problems could be broadly divided into two categories, namely SOOP and MOOP depending on the number of target functions which should be simultaneously optimized. SOOPs search for the best optimum on a surface defined by a single target function and its variables. While this might be a difficult task in particular for complex, multi-dimensional surfaces, a solution exists and a thorough enough search of the space, at least in theory, is bound to locate it. This, however, is not the case for MOOPs, which extend optimization theory by permitting several design objectives to be optimized simultaneously. The principle of MOOP was first formalized by Pareto . In MOOP, a single solution that outperforms all other solutions in all objectives does not necessarily exist. Instead, several equally good (termed non-dominated) solutions exist representing various compromises among the objectives. A solution is said to be non-dominated if it is better than all other solutions in at least one objective. The set of non-dominated solutions represents the Pareto front. This is illustrated in Figure 1, where each circle represents a solution to the problem. The curved line represents the Pareto front. Each solution is designated a Pareto rank that is based on the number of solutions which dominate it. The solid circles are non-dominated solutions, which have Pareto rank of zero and fall on the Pareto front. Dominated solutions are shown as empty circle, and the number of solutions which dominate them is written in parentheses.
Due to the complex nature of many of the chemoinformatics and material-informatics-related target functions (e.g., nonlinearity, non-continuity, non-derivability), non-derivatives-based, stochastic optimization algorithms should be used as the optimization engine. Several global stochastic algorithms including GAs , genetic programming , particle swarm optimization , MC /SA , and the iterative stochastic elimination method  are reported in the literature. The algorithms presented in this chapter utilize MC/SA- and GA-based optimizers to solve SOOPs and MOOPs.
3. Optimization engines
3.1. Monte Carlo/simulated annealing
MC methods and in particular the Metropolis variant  use random moves in order to optimize a multi-dimensional space defined by a dependent target function and a set of independent variables.
Metropolis MC starts from an initial random position. At each iteration, a new position (positioni+1) is randomly generated by a random displacement from the current position (positioni). The “energy” (i.e., the value of the target function) of the resulting new position is computed, and ΔE, the energetic difference between the current and the new position, is determined (Equation 1).
Thus, if the new position has a lower value of the target function (commonly referred to as lower energy), the transition is accepted. Otherwise, a uniformly distributed random number between 0 and 1 is drawn and the new position will only be accepted if the Boltzmann probability is higher or equal to the random number as defined in Equation 3, where kT is the effective temperature.
MC simulations are often coupled to a SA  procedure and usually called SA optimization. SA gradually decreases the temperature according to a predefined cooling schedule (Figure 2). This procedure controls the probability of acceptance or rejection of high energy moves. The SA procedure increases the probability of locating the global minimum.
3.2. Genetic algorithms
GAs  are global optimizers designed from evolutionary principles. Such algorithms evolve a set of chromosomes, each corresponding to a unique solution to the optimization problem, using a set of genetic operators such as selection of the fittest, mutation, and crossover. The selection of individuals to be subjected to the genetic operators is governed by their fitness values (calculated by a fitness function related to the specific scientific problem) so that chromosomes which represent a better solution to the optimization problem are given more chances to ”reproduce” than those chromosomes which are poorer solutions. This process iterates multiple times (generations) until no improvement in the fitness function is observed or until the number of predefined generation has been exhausted. By considering, at each generation, multiple rather than single chromosomes, GAs provide multiple solutions to the optimization problem.
4. Optimization-based algorithms
4.1. Representativeness and diversity analysis
Advances in combinatorial chemistry and high-throughput screening techniques have greatly expanded the number of drug-like compounds that could be synthesized and tested. However, even then, only a small fraction of the accessible chemistry space could be synthesized and screened . This in turn has led to the development of rational approaches for the design of screening libraries and in particular libraries composed of diverse compounds (i.e., compounds largely differing from one another). Hassan et al.  formulated the selection of a diverse subset from within a parent data set as an optimization problem using several diversity functions. In particular, the MaxMin function calculates the square of the minimal distance, dij2, over all (i,j) pairs comprising the selected subset according to Equation 4.
where dij is the distance between compounds i and j, and the summation runs over all the descriptors (features). The MaxMin function is optimized (maximized) by means of a SA algorithm to produce the subset with the largest value (i.e., the most diverse subset).
Treating the selection of a diverse subset as an optimization problem has the advantage that this objective could be combined with other objectives into a MOOP. For example, it is possible to select a subset that the best balances its internal diversity, pharmacological profile, and price.
However, diverse subsets are often biased toward the inclusion of outliers and as a result do not well represent the parent data set. In drug discovery, focusing on outliers may translate into selecting and testing “extreme” compounds, for example, compounds with too many functional groups or compounds with too high molecular weights. Such compounds are likely to be difficult to optimize into drug candidates . Unless such “extreme” compounds could be easily detected and removed (e.g., if their “extremeness” results from a single property), they are likely to remain in the data set.
Thus, instead of selecting diverse subsets, it might be advisable to select representative subsets, which better mirror the distribution of the parent data set. If properly selected, such subsets will include compounds which are different from one another yet are not “bizarre.” However, despite their potential usefulness, representative subsets have gained much less attention than diverse subsets.
Representative subsets could be selected using clustering algorithms such as hierarchical clustering and k-means clustering. Both partition an input data set into a predefined number of clusters. Hierarchical clustering produces a dendogram of clusters in which the root node contains all the compounds and the leaf nodes contain each, a single compound. Divisive hierarchical clustering starts at the root node and iteratively divides clusters until the leaf nodes are reached. Agglomerate hierarchical clustering starts with the leaf nodes and iteratively combines closest neighbors clusters until the root node is reached. Extracting from the dendogram a specific number of clusters and selecting a compound from each cluster leads to a representative/diverse subset. k-means clustering  operates by first selecting at random a user defined number of seeds (k), then by assigning each compound to its closest seed. This leads to the formation of k initial clusters. Centroids of all clusters are then calculated, and compounds are re-assigned to their closest centroid. This process is repeated until the clusters are stable, that is, no compounds are re-assigned. Again selecting a compound from each cluster provides a representative/diverse subset.
In 2014, Yosipof and Senderowitz  introduced an optimization algorithm for the direct selection of a representative subset from within a data set. The algorithm optimizes by means of a MC/SA procedure a representativeness function, based on pairwise distances between subset and data set compounds. The algorithm consists of the following steps (Figure 3):
Characterize each compound in the data set by a set of descriptors.
Normalize descriptors by converting them into Z-scores according to Equation 5 where μ is the mean, and σ is the standard deviation:
The resulting scores are then converted to [0,1] range by calculating the cumulative probability assuming a normal distribution (μ = 0; σ2 = 1).
Select a random subset s of size k from within the l compounds comprising the data set.
Calculate the Euclidean distance between compound i from the data set and all k compounds comprising subset s.
Take the minimum Euclidean distance from step (4) as the score for compound i:
Repeat steps (4) and (5) for all (l−k) compounds remaining in the data set.
Calculate the average score over all (l−k) compounds. This score characterizes subset s:
Minimize scores through a MC/SA procedure . At each step replace, at random, a single compound from s with a candidate compound from the unselected portion of the data set, calculate a new score, scores', and accept it according to the MC/SA algorithm.
Similar to diversity analysis, a major advantage in treating the selection of a representative subset as an optimization problem is the ability to combine it with additional objectives into a MOOP [20, 21]. This was demonstrated by the Pareto-based optimization of the representativeness and MaxMin functions. The Pareto algorithm evaluates MaxMin (Equation 4) and the representativeness function for a selected subset (termed a solution to the MOOP) and assigns to it a Pareto rank based on the number of solutions dominating it. In this case, solution i dominates solution j if MaxMin(i) < MaxMin(j) and Score(i) < Score(j) (where Score is calculated according to step 7 given above). Under this dominance criterion, the value of MaxMin is minimized rather than maximized in contrast with the original implementation of this function for diversity selection. MaxMin minimization biases the selected subset toward the more populated regions of the database allowing the two functions (MaxMin and representativeness) to work in concert. The Pareto rank is then minimized using Metropolis Monte Carlo, and the solutions with rank = 0 (i.e., non-dominant solutions) are kept to construct the Pareto front. Finally, a solution on the Pareto front is randomly selected. Alternatively, all the solutions could be presented to the user for manual inspection, evaluation, and selection.
Representative subsets could be used under two general scenarios: (1) Results obtained for a representative subset are used to infer on the properties of the parent data set. For example, the biological evaluation of a representative subset could provide information on the activities of the entire parent data set. Thus, testing only a representative subset will provide a similar amount of information as would have been gained by testing the entire data set. (2) A representative subset is selected from within a parent data set, set aside, and used to validate models generated by machine-learning algorithms. In the area of chemoinformatics, such models are known as quantitative structure–activity relationship (QSAR) models (see Section 4.2) .
To evaluate the representativeness algorithms under the first scenario, Yosipof and Senderowitz  selected a subset of 200 compounds from the Comprehensive Medicinal Chemistry (CMC) database using five representative/diversity algorithms, namely representativeness optimization, Pareto-based optimization, hierarchical clustering, k-means clustering, and MaxMin optimization. The CMC database contains 4,855 pharmaceutical compounds classified into 105 different biological indications. The degree with which each subset is able to represent the parent database (i.e., include a similar distribution of indications) was estimated using the χ2 goodness-of-fit test. In this test, the null hypothesis (H0) states that the distribution of biological indications within the subset and database are similar. In contrast, the H1 hypothesis states that these distributions are significantly different. The objective is therefore to stay on the null hypothesis. The χ2 statistics is defined as follows:
where Oi and Ei represent, respectively, the observed and expected frequencies for a biological indication i in the 200 compounds subset. Ei is derived from the frequency of indication i in the parent database. The results of this test demonstrated that the distribution of indications within the subsets selected by the representativeness optimization and the Pareto-based optimization are statistically indistinguishable (p-value >0.05) from that in the parent database. In contrast, subset selected by k-means clustering, hierarchical clustering, and MaxMin display distributions which are markedly different from the database (p-value <0.05).
To evaluate the representativeness algorithms under the second scenario (a representative subset is selected, set aside, and used to validate models built on the rest of the data set), it was incorporated into a workflow developed for the derivation of predictive QSAR models using machine-learning algorithms. Following the work of Tropsha  and others , it is today recognized that the predictive power of such models could only be evaluated from their performances on external test sets. In particular, test sets that uniformly span the chemistry space of the parent data set provide reliable performance estimates for QSAR models operating in this space. It therefore follows that such performance estimates correlate with how well a set of compounds represents the parent data set from which it was selected. Yosipof and Senderowitz  used the new representativeness algorithm to rationally select test sets of varying sizes from two data sets of pharmaceutical relevance (logBBB and Plasmodium falciparum inhibition) and estimated the performances of models derived with five classification techniques (decision trees, random forests, ANN, SVM, kNN) on these test sets. Similar test sets were also selected from the Pareto front generated by the simultaneous optimization of representativeness and MaxMin as well as by the k-means clustering, hierarchical clustering, and MaxMin for comparison. Model performances were estimated using the corrected classification rate (CCR; Equation 7)
where TN and TP represent the number of true negative and true positive predictions, respectively, and NN and NP represent the total number of the two activity classes.
The results (Table 1) indicate that the best performances were obtained with the Pareto method followed by the representativeness function and k-means clustering. The other two methods, namely hierarchical clustering and the MaxMin function led to poorer performances. Thus, representativeness-based methods indeed produce subsets which are more representative of the parent data sets.
4.2. Derivation of predictive QSAR models
QSAR (or QSPR, quantitative structure–property relationship) is a general name for a host of methods that attempt to correlate a specific activity for a set of compounds with their structure-derived descriptors (i.e., features) by means of a mathematical model.
QSAR models take the form Ai = f (D1, D2,…, Dn) where Ai is the dependent variable representing the activity (or any other property of interest) for a set of objects (e.g., compounds or materials), and D1, D2,…,Dn are calculated (or experimentally measured) independent variables (i.e., descriptors). f is an empirical mathematical transformation that should be applied to the descriptors in order to calculate the property values for the objects.
QSAR models are typically built according to a basic workflow which involves the following steps: (1) data collection; (2) preprocessing (data set preparation and curation, descriptors calculation, descriptors filtering); (3) model generation; (4) model validation.
Data collection involves the assembly of a data set of compounds/materials with known activities. Once collected, the data should be carefully curated, errors should be corrected (or if not possible, problematic compounds should be removed), and a set of descriptors should be obtained (calculated or measured). Finally, constant or nearly constant descriptors should be removed and usually, correlated descriptors are removed as well.
QSAR models are built using multiple machine-learning approaches. The modeling process begins with a modeling set and proceeds by performing regression-based or classification-based analysis to construct a model of the activity as a function of the descriptors. Machine-learning techniques are mostly used in this area, because they can deal with very complex relationships between descriptors and activities .
There are two types of regression-based methods, namely linear and nonlinear. An example of a linear method is provided by multiple linear regression which is extensively applied in Hansch analysis . Examples of nonlinear methods are the k-nearest neighbors (kNN) , and the random forest (RF)  methods. Finally, the resulting model should be validated on an external test set. An external set can be obtained by splitting the input and curated data set prior to the model development process or by obtaining additional data [26, 27].
Inherent to the most QSAR methods is feature (i.e., descriptor) selection. The purpose of this process is to select from within a large number of calculated/experimental descriptors those that best correlate with the activity. This stage is required since typically, the number of calculate-able descriptors by far exceeds that of compounds with measured activity data. Having too many descriptors calculated for too few compounds may lead to over fitting and chance correlation [23, 28]. In some cases, features selected for QSAR modeling are more important than the predictive model itself. These selected descriptors contain useful information that can reveal significant information and can help in understanding and rationalization the data and the results. The selection of descriptors subset could be treated as an optimization problem whereby a function related to model performances is optimized in the space of the descriptors.
One algorithm that couples feature selection procedure with a machine-learning algorithm at the model derivation stage is the kNN algorithm. This algorithm assumes that the activity of a compound could be predicted from the average activities of the k compounds most similar to it (kNN). This idea follows directly from the similar property principle  which states that similar compounds have similar properties. The similar property principle is well-validated in pharmaceutical sciences and was recently extended to photovoltaic cells . Since chemical similarity between two compounds depends on the molecular descriptors used to characterize them, the algorithm searches the space of available descriptors subsets for that subset in terms of which the similar properties principle is best satisfied. This is done by optimizing the leave-one-out (LOO) cross-validated value (Q2LOO, Equation 8) in the space of the descriptors (the space of the descriptors is a multidimensional space where each dimension corresponds to one descriptor. In this space compounds are represented by points and the distance between any two points represents the degree of similarity between the corresponding compounds). Q2LOO is given by
where Yexp is the experimental value YLOO is the predicted value and is the mean of the experimental results.
The kNN algorithm was first implemented by Zheng and Tropsha  using SA as the optimization engine. In this original implementation, only the descriptors space was searched with the SA procedure, while the number of nearest neighbors (k) was evaluated exhaustively (between 1 and 5). In 2015, Yosipof and Senderowitz  implemented the kNN algorithm optimizing in the space of the descriptors and the number of nearest neighbors. A schematic representation of the kNN optimization algorithm is provided in Figure 4.
The kNN algorithm has been extensively used to build predictive QSAR models in many fields including computer-aided drug design and environmental sciences . Recently, the method has been applied in the field of material-informatics and used to predict the photovoltaic properties of solar cell libraries (see Section 5) .
4.3. Outlier removal
Data sets in general and data sets consisting of molecular compounds in particular often contain objects (e.g., compounds) that are different in some respect from the rest of the data set. Such compounds are called outliers. The presence of outliers in a data set can affect machine-learning-related activities including model derivation, interpretation, and subsequent decision-making. In particular, outliers can compromise the ability of machine-learning algorithms to develop predictive models since many algorithms typically used for this purpose will attempt to fit the outliers at the expanse of the bulk. Thus, while outliers may point to an interesting behavior that needs to be investigated separately , they should be removed from the data set prior to the model construction.
Several methods for outliers removal have been reported in the literature . Statistical estimators could be used to identify outliers if their values follow a well-defined distribution [34, 35]. These methods are called parametric methods. For example, for a Gaussian distribution, outliers could be defined as compounds with descriptors values deviating from the mean by a certain number of standard deviations. For non-well-defined distributions, nonparametric methods should be used. For example, distance-based methods identify outliers by measuring Euclidean distances between objects in a predefined descriptors space. In the basic distance-based method, outliers are defined as compounds having at least p percent (define by the user) of their distances to the other compounds larger than a user defined threshold distance . An improvement to this method was proposed by Ramaswamy et al.  and termed the K-based method. According to this method, compounds are ranked according to their Euclidean distances to their kth-nearest neighbors and the n compounds with the highest rank (largest distances) are considered as outliers. Another method is based on compounds clustering and on subsequent removal of compounds (i.e., outliers) populating small clusters (e.g., singletons) . Another nonparametric method utilizes a variant of the support vector machine (SVM) algorithm, namely one-class SVM, which isolates the outlier’s class from the rest of the compounds .
The above-described techniques remove outliers in a single step from one predefined descriptors space and are therefore termed one-pass methods. These methods have several disadvantages. First, outliers are descriptor space-dependent with different spaces giving rise to different outliers. Second, if several outliers are present in a data set, they may mask each other so that some will not be recognized . Finally, the removal of one outlier based on a specific descriptor space may affect the distribution of the remaining compounds either in the same space or in different spaces, leading to the potential appearance of new outliers. These challenges could be met by removing outliers in an iterative manner.
Recently, Yosipof and Senderowitz  presented a new method for the iterative identification and subsequent removal of outliers identified in potentially different descriptors spaces based on the k nearest neighbors optimization algorithm (kNN-OR algorithm). According to this approach, an outlier is defined as a compound whose distance to its kNNs is too large for its activity to be reliably predicted (see Figure 5). At each iteration, the algorithm builds a kNN model, evaluates it according to the LOO cross-validation metric (; see Equation 8) and removes from the data set the compound whose elimination results in the largest increase in . This procedure is repeated until exceeds a pre-defined threshold.
In the example presented in Figure 5, the removal of compound b from the data set is expected to improve the model and to increase the value of kNN-. More generally,for a set of compounds whose activities are predicted via kNN, outlier(s) removal will lead to an increase in . Thus, the following procedure for outlier removal was developed (Figure 6):
For a set of compounds, run kNN to obtain the model with the highest .
For each compound, calculate the improvement in upon its removal from the data set.
Remove the compound whose elimination from the data set results in the largest increase in . When a compound is removed from the data set, it is also removed from the list of nearest neighbors of all other compounds. In such cases, the removed compound will be replaced by the next-in-line nearest neighbor for the purpose of activity prediction.
If no compound could be removed from the data set based on the first model (i.e., for all compounds, their removal from the data set does not lead to an improved ), repeat steps 2–3 for the second best model (which is built in a different descriptor space).
Repeat steps 1–4 above until is sufficiently high (stopping criterion).
The above-described kNN-OR algorithm is "greedy" in nature, removing at each iteration the compound that leads to the largest improvement in without considering the possibility that a sub-optimal improvement at a given iteration may pay off later. Nahum et al.  introduced a “look ahead” mechanism into outlier removal by treating it as a multi-objective optimization problem using genetic algorithm (GA-kNN). The new method simultaneously minimizes the number of compounds to be removed and maximizes kNN-derived . The multi-objective optimization is performed using the strength Pareto evolutionary algorithm 2 (SPEA2) , which approximates the Pareto front for MOOPs. SPEA2 uses an external set (archive) for storing primarily non-dominated solutions. At each generation, it combines archive solutions with the current population to form the next archive that is then used to produce offspring for the next generation. Each individual i in the archive At and the population Pt is assigned a raw fitness value R(i), determined by the number of its dominators in both archive and population. R(i)=0 corresponds to a non-dominated individual, whereas a high R(i) value means that individual i is dominated by many individuals. These raw values are then used to rank the individuals for the purpose of selecting candidates for reproduction. However, the raw fitness value by itself may be insufficient for ranking when the most individuals do not dominate each other. Therefore, additional information, based on the kth nearest neighbor density of the individuals, is incorporated to remove rank redundancy. The workflow of the SPEA2 algorithm is described below:
M—Offspring population size
T—Maximum number of generations
|Output:||A∗—Non-dominated set of solutions to the optimization problem|
Each solution to the multi-objective optimization problem specifies the set of descriptors, and the number of neighbors used by the kNN algorithm as well as the identity of compounds considered as outliers. This information was coded by a three component binary array (i.e., chromosome). The first part of the array encoded the number of neighbors using a binary representation. The second part described the descriptors identity (“1” and “0” representing, respectively, selected and unselected descriptors for the current solution). The third part listed the compounds considered as outliers using the same representation as for the descriptors.
The resulting chromosomes were subjected to a multi-site crossover operator to produce new chromosomes. These contained a new combination of the number of neighbors, descriptors, and outliers. The new chromosomes were further mutated to increase the diversity of the solutions population and to prevent trapping in local minima. Mutations were performed on the entire chromosome and consequently affected the descriptors, the number of nearest neighbors, and the identity of outliers.
For each new generation, raw fitness values were calculated for each individual based on the information encoded in its chromosome. This calculation was based on (which in turn depends on the descriptors selected, the number of neighbors selected, and the identity of outliers removed via the kNN algorithm) and on the number of outliers removed. The process was repeated until the termination criteria were met.
The performances of the two algorithms (kNN-OR and GA-kNN) were tested by removing outliers from three data sets of pharmaceutical relevance (logBBB, Factor 7 inhibitors, and dihydrofolate reductase inhibitors [DHFR]) and using the remaining compounds for two purposes: (1) to compare the internal diversities of the filtered data sets with those of the parent sets; (2) to build and validate QSAR models with two machine-learning methods, kNN, and random forest. The results clearly demonstrated that data sets rationally filtered with the two new outlier removal algorithms were more internally diverse and support QSAR models which provided better prediction statistics than data sets filtered by removing the same number of compounds using five other outlier removal methods (distance based, distance K-based, one-class SVM, statistics and random removal).
As an example, the results for the DHFR data set are presented below. This data set contained 673 compounds with known activity data (dependent variable). Each compound was characterized by 19 descriptors as the independent variables and by its biological activity in the form of an IC50 value as the dependent variable (Y). The compounds were subjected to the kNN-OR and the GA-kNN algorithms using a stopping criterion of For kNN-OR,this criterion was met after the removal of 87 compounds,leaving a total of 586 compounds for the subsequent diversity analysis and QSAR modeling. For the GA-kNN algorithm,this criterion was met after the removal of only 75 compounds. For comparison, 87 compounds were removed using the GA-kNN algorithm and the other six methods considered in that work.For the GA-kNN model,the removal of 87 compounds led to a model with
The internal diversity of the data set prior to and following compounds removal was evaluated by calculating pairwise Euclidean distances between all compounds in the original descriptors space. The results are presented in Figure 7 and demonstrate that the removal of either 75 compounds (11.1% of the data set) by the GA-kNN algorithm or 87 compounds (12.9% of the data set) by the GA-kNN and kNN-OR algorithms did not change the distribution of distances, suggesting that the coverage of chemistry space was largely unaffected by the removal of outliers. This in turn implies that the applicability domain for models derived from the filtered data set will not be reduced. In contrast, all other methods (except for random removal) demonstrated truncation at long distances implying reduced internal diversity.
Compounds surviving the filtration process were divided into a modeling set (469 compounds) and a test set (117 compounds) and subjected to QSAR modeling using kNN and RF. The resulting models were evaluated by standard parameters. Modeling sets subjected to kNN were evaluated using LOO cross-validation (Equation 8), while modeling sets subjected to RF (also known as out-of-bag set) were evaluated by the determination coefficient (Equation 9,).Test sets (external validation sets) were evaluated by the external explained variance equation Equation 10).
where Yexp is the experimental value YOOB and Ypre are the predicted value and is the mean of the experimental results over modeling set (training set) compounds.
The results are presented in Table 2 and demonstrate that the rational removal of outliers using the kNN-OR and the GA-kNN algorithms led to filtered data sets which produced both kNN-based and RF-based QSAR models with the best prediction statistics.
|Q LOO 2||Q e x t 2||R O O B 2||Q e x t 2|
While developed independently, all algorithms, together with additional tools, were incorporate into a machine-learning workflow for data mining and were used for the derivation of multiple QSAR models. The resulting workflow is depicted in Figure 8.
Below, we provide two examples taken from the fields of pharmaceutical and material sciences.
5.1. Blood–brain barrier permeation (logBBB) model
Blood–brain barrier permeability is an important parameter in drug design. Drugs targeting the central nervous system (CNS) are required to permeate through the blood–brain barrier. Conversely, drugs that do not affect the CNS should not penetrate the barrier due to potential side effects. Blood–brain barrier permeability is typically expressed as logBBB with positive and negative values indicating, respectively, permeating and non-permeating compounds. Since the experimental determination of logBBB values is resources consuming, multiple QSAR models have been developed to predict this property [19, 31, 41].
A data set of 152 compounds with known logBBB values was compiled from the literature, manually curated and characterized by 15 descriptors [19, 31]. This data set was subjected to the kNN-OR algorithm (Section 4.3) using a stopping criterion of The application of this criterion led to the removal of 19 compounds, leaving 133 compounds for QSAR modeling. The stepwise values upon compounds removal is presented in Figure 9.
Compounds surviving the filtration procedures were divided into a modeling set (106 compounds) and a validation test set (27 compounds) using the representativeness function described in Section 4.1. QSAR models were built on the modeling sets using either kNN (Section 4.2) or the random forest (RF) algorithms, and the resulting models were tested on the validation sets. Model performances were evaluated with Equations 8–10 and were found to be and for kNN and and for RF . These values are similar to those obtained by other logBBB QSAR models.
Similar results were obtained when this data set was filtered using the GA–kNN algorithm . In this case, the stopping criterion (was met after the removal of 13 compounds. As before compounds surviving the filtration procedure were divided into a modeling set (111) and a test set (28) and subjected to QSAR modeling leading to and for kNN and R2OOB = 0.67 and =0.65 for RF .
5.2. Photovoltaic activities of solar cells
The raise in demands for clean energy is likely to increase the importance of solar cells as future energy resources. In particular, cells entirely made of metal oxides have the potential to provide clean and affordable energy if their power conversion efficiencies are improved. Designing solar cells with improved photovoltaic properties could be greatly assisted by the application of optimization algorithms as described in this chapter.
Yosipof et al.  used the workflow described in Figure 8 to build predictive kNN-based models for a library of 169 solar cells made of a combination of titanium and copper oxides (TiO2|Cu2O). In this case, the dependent variables which had to be predicted were the open-circuit voltage (VOC), the short-circuit current (JSC), and the internal quantum efficiency (IQE), and the dependent variables consisted of the thicknesses of the copper oxide and the titanium oxide layers, the ratio between the layers thicknesses, the experimentally measured bandgap and the maximum theoretical calculated photocurrent. In this case, subjecting the cells to the kNN-OR algorithm did not lead to the removal of any outliers and consequently, the entire library could be used for QSAR modeling. The library was divided into a modeling (training) set and a validation test set using the algorithm described in Section 4.1, and the predictive QSAR models were derived with the kNN algorithm described in Section 4.2. The results are presented in Table 3.
|End point||Q LOO 2||Q e x t 2||Descriptors selected|
|J SC||0.92||0.92||The thickness of the titanium oxide layer and the thickness of the copper oxide layer|
|V OC||0.78||0.89||The thickness of the titanium oxide layer and the thickness of the copper oxide layer|
|IQE||0.91||0.87||The thickness of the titanium oxide layer and the thickness of the copper oxide layer|
Overall, the resulting models demonstrated good prediction statistics suggesting that they are likely to be useful for the design of new and improved solar cells. In addition, the feature selection procedure inherent to the kNN algorithm highlighted the importance of the metal oxide layer thickness in controlling the photovoltaic properties (last column in Table 3).
In this chapter, we introduced several tools based on global stochastic optimization for chemoinformatics and material-informatics applications in three areas. To select representative subsets from within parent data sets, we described a new representativeness function and its optimization either alone or simultaneously with the MaxMin function. The two resulting algorithms were found to outperform previously reported subset selection methods.
For the derivation of predictive, nonlinear machine-learning models we reviewed the kNN algorithm, which searches the space of available descriptors combinations to identify those that best satisfy the similar properties principle. Descriptors spaces compatible with this principle give rise to models with good prediction statistics.
Finally, we introduced two new algorithms for the identification and subsequent removal of outliers based on the kNN method. The kNN–OR algorithm iteratively removes from the parent data set compounds (outliers) which the best improve model performances. The GA–kNN algorithm simultaneously optimizes model performances together with the number of outliers. This algorithm is usually able to remove a smaller number of outliers while still maintaining good model performances. Retaining in the data set submitted to QSAR modeling as many compounds as possible is likely to increase the applicability domain of the resulting model. The new algorithms were found to outperform other outlier removal methods when tested on three data sets.
Finally, the new algorithms, together with additional tools, were combined into a machine-learning workflow and used for the derivation of predictive QSAR models.
The algorithms presented in this chapter are likely to be useful for multiple applications in the fields of chemoinformatics and material-informatics.