Alkylated phenols used in this study and their experimental linear retention indices.
In this study, 29 volatile alkylated phenols were subjected to a quantitative structure retention relationships (QSRR) studies; we have developed two- and three-dimensional quantitative structure retention relationships (2D- and 3D-QSRR) for this series; and these molecules were subjected to a 2D-QSRR analysis for their retention property using stepwise multiple linear regression (MLR) and 3D-QSRR analysis using partial least squares (PLS). The 28 descriptors are calculated for the 29 molecules using the ChemOffice and ChemSketch software to construct 2D-QSRR model. The 3D-QSRR models were constructed using comparative molecular field analysis (CoMFA) method. The models were used to predict the linear retention indices of the test set compounds, and agreement between the experimental and predicted values was verified. The statistical results indicate that the predicted values are in good agreement with the experimental results (r2 = 0.980; r2CV = 0.977 and r2 = 0.998; r2CV = 0.959 for MLR and CoMFA methods, respectively). To validate the predictive power of the resulting models, external validation multiple correlation coefficient was calculated; in addition to a performance prediction power, this coefficient has a favorable estimation of stability for the two methods (rtest = 0.938 and rtest = 0.955 for MLR and CoMFA methods, respectively).
- quantitative structure retention relationship
- linear retention indices
- multiple linear regression
- molecular field analysis
- external validation
- alkylated phenols
Phenols are widely present in the environment as building blocks for plants . They are formed naturally from decomposition of leaves and wood as well as through human activity like water purification processes . Alkylphenols are a family of organic compounds obtained by the alkylation of phenols. The term is usually reserved for major industrial compounds such as propylphenol, amylphenol, heptylphenol, octylphenol, nonylphenol, dodecylphenol, and other long-chain carbon compounds. Methylphenols and ethylpenols are also alkylphenols, but are more often referred to by their specific names, cresols and xylenols, respectively. The alkylated phenols have a good ability to be adsorbed on solid materials and some are toxic to fish and other forms of aquatic environment. Very low concentrations of these molecules have unfavorable effects on the taste and odor of water and fish .
All phenolic compounds can be considered as important parameters of the organoleptic (color, flavor, and aroma) and nutritional qualities of food products. The phenolic compounds which participate in the vegetable aroma are relatively simple volatile compounds whose odors can be pleasant or unpleasant. Vanilla, for example, is the most popular aroma in the world, and its production is estimated at 1500 tons per year . Approximately 250 compounds are responsible for vanilla aroma and among these are about 20 phenolic compounds, the most abundant of which are vanillin, p-hydroxybenzaldehyde, and vanillic acid . The spices we use to enhance taste and flavor of food contain volatile compounds characterized by the presence of a methoxyl group. 4-vinyl guaiacol is responsible for the pleasant odors that occur during the manufacture and storage of citrus juices (orange and grapefruit in particular). This compound is formed from the degradation of ferulic acid, and the quality of the orange juice aroma is directly related to changes in free ferulic acid and 4-vinyl guaiacol contents . These two compounds are also produced during the thermal degradation of lignin. With their derivatives (4-methyl guaiacol, 4-ethyl guaiacol, vanillin, vanillic acid, etc.), they are at the origin of the aroma developed by the smoking techniques used in meat and fish conservation .
Some alkylated phenols represent another group of compounds with a constantly weak odor. In addition, some individual odorants in this group have been described in several studies as having various sensory properties. Because of their obviously high odor potency, the odor thresholds of the alkylated phenols have been extensively evaluated.
The multidimensional quantitative structure-activity/property relationship (multidimensional-QSAR/QSPR) analysis is a computational method used to predict biological activities or chemical properties of existing or supposed chemical compounds. With incessant development, the multidimensional-QSAR/QSPR analyses have made notable achievement in diverse fields, such as toxicology and medicinal chemistry [8, 9]. Through the fast progress of computer science and theoretical study, it can quickly and accurately find molecular information (chemical descriptors) of compounds by computation. These chemical descriptors used in the construction of the QSAR/QSPR models can increase the interpretability and can predict the activity/property of new molecules .
The release of odorant molecules from a solid or liquid medium and their passage in the vapor phase is the first step before a possible perception due to the activation of the olfactory receptors present in the nasal cavity followed by a series of complex neurophysiological reactions, in order to code a particular smell, that’s why in this study, a series of 29 volatile alkylated phenols, including monoalkylated phenols and di- and trimethylphenols, were subjected to a quantitative structure retention relationships (QSRR) studies, we have developed two- and three-dimensional quantitative structure retention relationships (2D- and 3D-QSRR) for a series of 29 molecules odorants based on phenol. We construct 2D-QSRR model using 28 descriptors. The 3D-QSAR/QSPR models were constructed using the comparative molecular field analysis (CoMFA)  tools that collect and interpret complex data from series of bioactive molecules to construct computational models that correlate chemical properties with biological activity/propriety . Through this approach, molecular features responsible for the retention property of the investigated compounds (alkylated phenols) were identified using the CoMFA contour plots. Furthermore, the statistical consistency of the developed models was evaluated on the basis of their correlation ability for the training set, as well as their predictive power for an external test set. We accordingly propose quantitative models, using stepwise multiple linear regression (MLR) for 2D-QSRR analysis and the partial least squares (PLS) for 3D-QSRR model, and we try to interpret the retention property of the compounds relying on the multidimensional-QSRR analyses .
2. Material and methods
2.1 2D-QSRR study
2.1.1 Data set
The reliability of the 2D-QSRR analysis is depending on the available data set, and the method of analysis and the validations. In the present analysis, a series of 29 selected alkylated phenols that have been evaluated for their linear retention indices was taken from literature, and as reported in the literature , high-resolution GC/O (HRGC/O) analyses were performed with a type 5160 gas chromatograph (Carlo Erba), and the analyses were accomplished using DB-1701, as demonstrated by Czerny et al. . We considered to carry out the 2D-QSRR analysis: 24 molecules are selected to propose the quantitative model (training set) and 5 compounds that have been selected randomly and were not used in training set have served to test the performance of the proposed model (test set). Table 1 shows the studied compounds and the experimental linear retention indices values (LRI).
2.1.2 Molecular descriptors generation
Twenty-eight molecular descriptors were calculated using ACD/ChemSketch and ChemOffice programs [15, 16] to predict the correlation between these descriptors and the retention property of studied compounds and to develop a linear model . The descriptors used in this study are displayed in Table 2.
|ChemOffice||Melting point T (Kelvin); molecular weight MW (g/mol); critical temperature CT (Kelvin); heat of formation H° (kJ mol−1); boiling point TB (Kelvin); Gibbs free energy G (kJ mol−1); critical pressure CP (Bar); Connolly solvent-excluded volume V (A°)3; shape coefficient I; total connectivity TC; Log P; number of rotatable bonds NRB; winner index (W); number of H-bond acceptors (NHA); molecular topological index MTI; number of H-bond donors (NHD); partition coefficient PC; Balaban index (J); Henry’s law constant KH; polar surface area PSA (A°)2; total valence connectivity TVC; sum of valence degrees SVD|
|ChemSketch||Percent ratios of nitrogen, hydrogen, oxygen, and carbon atoms (H%; O%; C%); surface tension γ (dyne/cm); index of refraction (n); density (d)|
2.1.3 Statistical analysis
To explain the structure-property relationship, 28 descriptors are calculated for the 29 molecules using the ChemOffice and ChemSketch software, and they were subjected to a stepwise multiple linear regression (MLR) available in the SPSS software . The stepwise MLR was generated to predict retention property values Log(LRI). Equation was justified by the correlation coefficient (r), the root mean square of the errors (RMSE), the Fishers F-statistic (F), and the significance level (P-value) .
The final stage of this 2D-QSRR analysis consists of statistical validation in order to assess the significance of the model and hence its ability to predict property of other compounds. In this chapter, the model was validated internally by the cross-validation test. The cross validations are statistical techniques in which different proportions of chemicals are iteratively held out from the training set used for model development. In this chapter, the leave-one-out procedure is used; this process sequentially removes one compound from the training set containing 24 compounds. A 2D-QSRR model is created on a “23” set of molecules, and the molecule removed is predicted by the constructed model. This process is repeated “24” times in order to predict the retention property of all compounds .
2.2 3D-QSRR study
2.2.1 Minimization and alignment
Chemical structures of studied compounds were sketched with sketch module in SYBYL  and minimized using Tripos force field  with the Gasteiger-Hückel charges  and conjugated gradient method, and gradient convergence criteria of 0.01 kcal/mol. Simulated annealing on the energy minimized structures was performed with 20 cycles.
Molecular alignment is one of the most sensitive parameters in 3D-QSRR methods. In this work, all studied compounds were aligned on the common core (compound no. 1), using the simple alignment method in Sybyl . Compound no. 18, which was the most active compound (with highest Log(LRI)), was used as template (Figure 1).
2.2.2 CoMFA studies
Based on the molecular alignment, CoMFA studies were performed to analyze the specific contributions of steric and electrostatic effects. These interactions were calculated using the Tripos force field with a distance-dependent dielectric constant at all interactions in a regularly spaced (2 Å) grid taking a sp3 carbon atom as steric probe and a +1 charge as electrostatic probe. The cutoff was set to 30 kcal/mol . With standard options for scaling of variables, the regression analysis was carried out using the fully cross-validated partial least squares (PLS) method (leave one out) . The final model that is non–cross-validated conventional analysis was developed with the optimum number of components to yield a non–cross-validated r2 value.
2.2.3 Partial least squares analysis (PLS) and validation
The 3D-QSRR models were generated using a training set of 24 molecules. Predictive power of the resulting models was evaluated using a test set of five molecules (Table 1). The test compounds have been selected randomly. PLS analysis used to construct the 3D-QSRR models is an extension of multiple regression analysis in which the initial variables are replaced by optimum number of components of their linear combinations. PLS statistical method with leave-one-out (LOO) cross-validation procedure was used in this work to determine the optimal numbers of components considering cross-validated coefficient rCV for the training set of 24 molecules. The external validation of created models was determined using five compounds (test set). The final analysis (non-cross-validated analysis) was carried out using the optimum number of components obtained from the cross-validation analysis to get correlation coefficient r2 [27, 28].
3. Results and discussions
3.1 2D-QSRR study
3.1.1 Data set for analysis
A 2D-QSRR study was carried out for a series of 29 alkylated phenols, as indicated above, to determine a quantitative relationship between the structure and the retention property. The values of the 28 descriptors are shown in Table S1 (in Supplementary Material).
3.1.2 Stepwise multiple linear regression MLR
The stepwise multiple linear regression (MLR) procedure based on the forward selection and backward elimination method (including the critical probability: P-value <0.05 for all descriptors and for the model complete) was employed to determine the best regression model.
The 2D-QSRR model built using stepwise MLR is represented by the following equation:
N = 24; r = 0.990; r2 = 0.980; RMSE = 0.008; F = 1085.981; P < 0.0001.
In this equation, V is the Connolly solvent-excluded volume, N is the number of compounds, r is the correlation coefficient, r2 is the coefficient of determination, RMSE is the root mean square of the errors, F is the Fisher’s criterion, and P is the significance level.
It is observed that the coefficient of correlation r is high, and RMSE is low, which makes it possible to indicate that the model is reliable. A P value much smaller than 0.05 indicates that the regression equation is statistically significant; thus, we can conclude, with confidence, that the model provides a significant amount of information [29, 30].
3.1.3 Internal validation (cross-validation)
The 2D-QSRR model expressed by the equation of stepwise MLR method is validated by its appreciable value of r2CV obtained using the leave-one-out (LOO) procedure. The value of r2CV greater than 0.5 is the basic condition for qualifying a 2D-QSRR model as valid. The model’s performance was good and was characterized by r2CV value of 0.977 with the descriptor (V) proposed by the stepwise MLR.
3.1.4 External validation
The model created in the calculation process using the alkylated phenols is used to predict the retention property values (Log(LRI)) of the remaining (five molecules). The results obtained by stepwise MLR model are very sufficient to conclude the performance of models; it is confirmed by the test done with the five compounds (rtest = 0.938; r2test = 0.880).
3.1.5 Domain of applicability
Evaluation of the applicability domain of the 2D-QSRR model is considered as an important step to establish that the model is reliable to make predictions within the chemical space for which it was developed . In this chapter, we used leverage approach . Leverage of a given chemical compound hi is defined as follows:
where xi is the descriptor row of the query compound and X is the descriptor matrix of the training set compounds used to develop the model. As a prediction tool, the warning leverage h* is defined as follows:
where n is the number of training compounds and P is the number of descriptors in the model.
From the Williams plot (Figure 3), it is obvious that all the compounds in the data set are within the applicability domain of the model (the warning leverage limit is 0.250) except one training compound (no. 18); these compounds have their leverage values greater than the warning h* value and could be high leverage compound influencing the performance of the model. However, their standard residual values are very low and within the established limit . As a result, this compound could be considered as influential in fitting the model performance but not necessarily outliers to be deleted from the training dataset, and thus, the model can be applied with confidence within the defined applicability domain.
For all the compounds in the training and test sets, their standardized residuals are smaller than three standard deviation units (3 ± δ) except one test compound (No 29). Thus, compound no. 29 can be as outlier. Because this compound is one of the test set compounds, there is no need to remove this compound from the data set.
Therefore, the predicted of linear retention indices values (Log(LRI)) by the developed stepwise MLR model is reliable.
3.2 3D-QSRR study
3.2.1 Molecular alignment
All other compounds were aligned on the basis of the common structure (compound no. 1). Alignment of training and test set compounds using distill module is shown in Figure 4.
3.2.2 CoMFA result
The 3D-QSRR models were obtained from the CoMFA analysis, and its statistical parameters are listed in Table 3. The values of predicted Log(LRI) are calculated by CoMFA model, and the observed values are given in Table 4. The correlations of predicted and observed Log(LRI) values are illustrated in Figure 5.
|No||Log(LRI) (obs.)||Log(LRI) (calc.)|
We use cross-validation as an internal test of the quality of the PLS models. And to evaluate the predictive power of a QSRR model (external test), the Log(LRI) of the remained set of five molecules (test set) are deduced from the constructed model with the 24 compounds (training set) by CoMFA model (Table 3).
where r2CV is the square of the LOO cross-validation (CV) coefficient; r2 is the square of the non-CV coefficient; SE is the standard error of prediction; F is the F-test value; N is the optimum number of components; and r2test is the external validation correlation coefficient for test set compounds.
The 3D-QSRR models gave good statistical results in terms of r2 value (r2 = 0.998) for the CoMFA model. This approach has good predictive capability gives good results (r2CV = 0.956). The model was able to establish a satisfactory relationship between the molecular descriptors and the linear retention indices of the studied compounds. The results obtained by CoMFA analysis are sufficient to conclude the performance of the model; it is confirmed by the test done with the five compounds (Table 3).
3.3 Comparison between 2D- and 3D-QSRR results and design of novel alkylated phenols
Aiming to provide a comparison among the stepwise MLR and CoMFA models, Table 5 lists the main statistical indicators for 2D- and 3D-QSRR models.
|r = 0.990; r2 = 0.980|
rCV = 0.988; r2CV = 0.977
rtest = 0.938; r2test = 0.880
|r = 0.999; r2 = 0.998|
rCV = 0.979; r2CV = 0.959
rtest = 0.955; r2test = 0.913
A comparison of the quality of stepwise MLR and CoMFA model (Table 5) shows that the two approaches stepwise MLR and CoMFA have better predictive capability gives better results. Stepwise MLR and CoMFA models were able to found a suitable relationship between the chemical descriptors and the linear retention indices of the studied molecules.
Multidimensional-QSRR correlates retention property with the physicochemical and structural descriptors of a series of molecules. It has been habitually used to predict retention of new molecules and to propose molecules with preferred properties. The constructed models can be used for the designing of new alkylated phenols with higher or lower property values (Log(LRI)).
In this way, we can design new compounds by adding suitable substituents and calculate their property using stepwise MLR equation. The stepwise MLR equation indicated the positive correlation of the Connolly solvent-excluded volume (V).
The obtained results show that, to increase propriety of alkylated phenols, we will increase Connolly solvent-excluded volume (V) value of these molecules. Moreover, to decrease property, we will decrease the descriptor (V) value, by adding suitable substituents, and calculate their property using the regression equation. This study consists of the first step explored to code a particular odor of this group of molecules, followed by docking molecular study that allows understand the mechanism of activation of olfactory receptor present in the nasal cavity by this type of molecules.
We can also use the results of 3D-QSRR to design new alkylated phenols with higher or lower retention property values (Log(LRI)). The CoMFA contour plots were able to identify that molecular fragments, functional groups, and physicochemical properties strongly correlated with the linear retention indices of this series. CoMFA steric and electrostatic contours are shown in Figure 6.
The steric interaction is represented by green and yellow contours, while electrostatic interaction is denoted by red and blue contours. The green region around the 2, 3, 4, 5, and 6 positions (the carbon to which the initial ▬OH is bonded is counted as the first position) (Figure 6a) indicates that bulky groups are favored and they might increase the property. That can explain very well why the property of the alkylated phenols with a group bigger than Et group (case of compounds 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, and 18) is higher than those of other compounds. We can also explain, that for the alkylated phenols with a group smaller than Pr group, on the one hand, the property of dimethylated phenols is higher than those of monoalkylated phenols and, on the other hand, the property of the trimethylated phenols is higher than those of dimethylated phenols and monoalkylated phenols. The bigger green region is observed around the four positions in comparison with the other positions, suggesting that groups with steric tolerance are required at this position to reach the green area, which means to increase the property, this fact can be used to further explain why compounds 14, 15, 16, 17, and 18 have highest property than those of all other compounds.
The CoMFA electrostatic contour plot is displayed in Figure 6b. A blue contour indicates that substituents should be electron deficient, and red color indicates that substituents should be electron rich. The blue contour near the 2, 3, 4, and 5 positions (Figure 6b) indicates that electron-donating substituents (such Alkyl group) are beneficial for propriety in this area. The electrostatic contour map displays a region of red contours neighbor to the 1 and 6 positions indicating that groups with negative charges may increase the property.
All these findings may be used to design improved compounds with higher or lower retention property, as observed in the CoMFA maps, by adding suitable substituents.
In this study, 2D- and 3D-QSRR analyses were used to predict the linear retention indices of a set of alkylated phenols. The multidimensional-QSRR models gave good statistical results in terms of rCV and r values. The stepwise MLR and CoMFA models showed high internal and external consistency; this is verified using different validation methods to evaluate their statistical quality. External validation using a test series verified the capacity of these models to estimate with appropriate precision the linear retention indices of alkylated phenols. In addition, the stepwise MLR equation and CoMFA contour plots can identify that physicochemical properties, organic functional groups, and chemical molecular fragments strongly correlated with the linear retention indices of this studied compounds. The highlighted features are important information for delineating the chemical space, which can be used to design new volatile alkylated phenols. This study consists of the first step explored to code a particular odor of this group of molecules, followed by docking molecular study that allows understand the mechanism of activation of olfactory receptor present in the nasal cavity by this kind of chemical compounds.
We are grateful to the “Association Marocaine des Chimistes Théoriciens” (AMCT) for its pertinent help concerning the programs.
Conflict of interest
“The authors declare that they have no competing interests.”
Table S1. Molecular descriptors computed by ChemOffice and ChemSketch software.