Summary of liquefaction-induced settlement database.
A liquefaction-induced settlement assessment is considered one of the major challenges in geotechnical earthquake engineering. This paper presents random forest (RF) and reduced error pruning tree (REP Tree) models for predicting settlement caused by liquefaction. Standard penetration test (SPT) data were obtained for five separate borehole sites near the Pohang Earthquake epicenter. The data used in this study comprise of four features, namely depth, unit weight, corrected SPT blow count and cyclic stress ratio. The available data is divided into two parts: training set (80%) and test set (20%). The output of the RF and REP Tree models is evaluated using statistical parameters including coefficient of correlation (r), mean absolute error (MAE), and root mean squared error (RMSE). The applications for the aforementioned approach for predicting the liquefaction-induced settlement are compared and discussed. The analysis of statistical metrics for the evaluating liquefaction-induced settlement dataset demonstrates that the RF achieved comparatively better and reliable results.
- random forest
- REP tree
The evaluation of liquefaction-induced settlements has become an extremely significant issue about the foundations of different buildings, nuclear power plants, and earth dams on sandy soil deposits. Saturated sand deposits when are endured during an earthquake, pore water pressures are known to develop contributing to liquefaction or loss of shear strength. The pore water pressure then begin to dissipate primarily towards the ground surface, followed by a change in the volume of soil deposits which is manifested on the ground surface as settlements. Settlements caused by liquefaction are conventionally predicted using analytical or numerical methods.
Tokimatsu and Seed  developed a technique for predicting ground post liquefaction settlements based on volumetric strain, SPT N-value and cyclic stress ratio (CSR) relationships in the case of completely liquefied saturated sands transformed from an experimental relationship between relative sand density, volumetric strain, and maximum shear strain. Ishihara and Yoshimine  used an alternative approach to estimate ground settlements based on the safety factor, by means of the maximum shear strain which is an essential factor affecting the post-liquefaction volumetric strain. The liquefaction-induced settlement during the earthquake can be identified if the safety factor and relative density are established. Furthermore, the simplified method was constructed only by a relation between relative density, the factor of safety against liquefaction (FS) and volumetric strain (εv) to quantify the settlement of a site where the safety factor of safety against liquefaction was obtained By combining earthquake intensity and SPT N-value with empirical equations to cause measurement error and lead to significant prediction error .
Analytical method used to assess liquefaction-induced settlements is based on the effective stress analysis of dynamic response which accounts for the generation and dissipation of excess pore water pressures. When used to evaluate post-liquidation settlements in saturated sand deposits, the volume compressibility coefficient of the sand is required which is very difficult to determine for the liquefied sand layer . Shamoto et al.,  suggested a simplified approach for estimating liquefaction-induced settlements of saturated sand deposits, based on the experimental evidence that there is an almost linear relationship between the function of the void ratio and the logarithm of the maximum shear strain induced during cyclic loading.
In numerical analysis, earthquake-induced liquefaction in the free-field may be interpreted as a 1D phenomenon occurring along a vertical soil column in which seismic-induced cyclic shear and compressive forces increase the pore pressure and hence cause a reduction in the transient soil strength and stiffness. Reconsolidation arises in the soil after liquefaction due to the dissipation of the excess pore pressure (∆u) by means of water flow, resulting in the vertical settlement of the ground surface .
Park et al.  established a simple and sustainable method for predicting liquefaction-induced settlement using ANN. Tang et al.  found that the ANN and Bayesian Belief Networks (BBN) predictive outcomes are better than the Ishihara and Yoshimine simplified approach.
Pohang earthquake (Mw = 5.4) that hit the Heunghae Basin around Pohang city had a liquefaction-induced damages—settlement and lateral displacement. In this study liquefaction-induced settlement is considered as a case of illustration. Several efforts have been made since the event to evaluate the post-earthquake damages [7, 8, 9, 10, 11]. Nevertheless, the liquefaction-induced settlement has received little attention. Settlement caused by liquefaction is commonly calculated by taking into account various factors and following several sophisticated analytical and numerical procedures. Nevertheless, in most cases it may not be possible to acquire such parameters in the field, as some of the required data may not be obtainable. The main purpose of this study is to evaluate liquefaction-induced settlement based on the database of field observations. To achieve this purpose, the random forest and REP tree techniques are used to develop two new models for evaluation of liquefaction-induced settlement. Although these techniques have been successfully applied in many domains, the application in geotechnical earthquake engineering is limited based on the literature surveys.
The remainder of this chapter is organized as follows: Section 2 briefly provides the description of data acquisition for liquefaction-induced settlement calculation. Section 3 presents the methodology used to evaluate settlement caused due to earthquakes; an overview of the random forest and Rep tree techniques. Section 4 presents the development of the liquefaction-induced settlement models. Detailed results of the proposed models are discussed by performance evaluation measures are presented in Section 5, followed by conclusions in Section 6.
2. Data acquisition
In this study, Park et al.  collected database from the Integrated DB Centre of National Geotechnical Information, Korea  and the UBCSAND constitutive effective stress model  was used to develop predictive models. SPT data were obtained for five different borehole sites near the epicenter of the earthquake at Pohang. The input parameters for the RF and REP Tree models are depth (m), unit weight (kN/m3), corrected SPT blow count (N1(60)) and cyclic stress ratio (CSR) and the output is the observed settlement (mm). For details about the database, readers can refer to Park et al. . The summary of the data base comprised 100 data points (20 data for each borehole) along with the corresponding settlement values is shown in Table 1.
|Borehole||Depth (m)||Unit Weight (kN/m3)||N1(60)||CSR||Settlement (mm)|
3.1 Random forest
Random Forest (RF) is an ensemble machine learning technique driven by the development of a large number of decision trees that is produced by Leo Breiman . Unlike DT, which uses all the features to construct a tree-like classification graph, RF uses an “efficient bagging” learning algorithm which integrates random selection of features with bagging. If one or a few features are very good predictors for target performance, it will pick this subset of features to construct a tree-like graph. This type of sample is known as the Bootstrap Sample. Using bagging techniques, these models are fitted with the above bootstrap samples, and then combined by voting. RF improves reliability and precision, reduces uncertainty and helps avoid overfitting.
Bootstrap aggregation or bagging is used to determine an appropriate number of trees with the size and nature of the training set. The RF prediction can be expressed as: by averaging the predictions from the individual regression trees;
An optimal number of trees are calculated by bootstrap aggregation or bagging with the size and nature of the training set. By averaging the predictions from the individual regression trees; The RF prediction can be expressed as:
whererepresents the RF prediction from the total of N trees, and denotes the prediction of each individual tree with the input x. In addition, an approximation of the uncertainty of the prediction can be made as the standard deviation of the predictions from all the trees, which can be expressed as:
Figure 1 demonstrates the method of classifying RF with the N trees. Starting from the root node (νn), after comparison with certain parameters or threshold values, samples are moved to the right node (νR) or the left node (νL). Repeat this partition until a terminal node is reached and get a classification tag (in this case, classes A or B). For classification task, the ensemble prediction is achieved by majority voting rule as a combination of the results of the individual trees .
3.2 REP tree
The reduced error pruning tree (REP Tree) is an ensemble model of decision tree (DT) and The REP Tree (Reduced Error Pruning Tree) is an ensemble model of decision tree (DT) and reduced error pruning (REP) algorithms, equally good for classification and regression problems . The REP Tree algorithm generates a decision regression tree by dividing and pruning the regression tree based on the importance of the highest knowledge benefit ratio (IGR) ; The IGR values were determined via Eq. (3) based on the entropy (E) function.
The IGR considers all the predictors of liquefaction-induced settlement with subset Si from the training dataset (S): i = 1, 2,. .., n successive pruning steps. Since complex decision trees can result in a model being overfitted and less interpretable, REP helps to reduce complexity by removing the DT structure’s leaves and branches [16, 18, 19, 20].
4. Liquefaction-induced settlement model development
4.1 Preparing training and testing datasets
The manner in which data are divided into training and test data sets in data mining procedures has a substantial effect on the results [21, 22, 23]. The statistical parameters for the input variables include the minimum, maximum, mean and standard deviation of the training and test datasets, as shown in Table 2. Data set splitting was done to assess the generalization efficiency and predictive ability of the developed models. The related performance of the training and testing datasets suggests that the developed models can be applied to the trained ranges. In the testing the ranges of input and output parameters often occur in the training datasets as shown in Table 2. The training and testing datasets’ statistical consistency enhances the performance of the developed models and thus helps to properly assess them.
|Dataset||Statistical parameter||Depth (m)||Unit Weight (kN/m3)||N1(60)||CSR||Settlement (mm)|
To ensure comparability, the RF and REP Tree models are proposed using the same training and test datasets. Using these models, liquefaction-induced settlements are predicted, and an analysis of the detailed performance of these models will find the optimum model afterwards. If the performance of this model on the training and test datasets is adequate then it can be adopted for development.
4.2 Evaluation measures
In this study, three evaluation measures, mean absolute error (MAE), root mean square error (RMSE), and correlation coefficient (r) are used to evaluate and compare the performance of the models. The MAE, RMSE and r are three useful statistical measures which provide some useful insights into the prediction model, of which the MAE is an average of the sum of the differences between the values predicted by a model and the actual values, the RMSE is a standard deviation of the differences, and the correlation coefficient (r) is a statistical measure representing the percentage of the variance for a model a dependent variable that’s described by an independent variable, and their expressions are as follows :
where and are the observed and predicted value of ith sample of the data respectively, and are the mean values of the observed and predicted values respectively, and n is the total number of samples. MAE can be given as a more natural and unambiguous index compared with RMSE to quantify errors between the estimated and actual observed values [25, 26]. RMSE was used as a standard statistical metric to assess output of a model . The larger correlation coefficient (r) and lower mean absolute error (MAE) values, and the root mean squared error (RMSE) present a higher accuracy of predicted results.
5. Results and discussion
Theoretically, a specific model can be obtained when the model parameters are correctly selected and updated. The optimum values are obtained by trial and error using parameter setting. The optimum value for each machine learning parameter is illustrated in Table 3. In the proposed RF and REP Tree models the most significant parameters are the number of seeds and the minimum total weight of instances in a leaf during the modeling process.
|RF||Minimum total weight of instances in a leaf: 1; minimum portion of the variance of all the data to be present in a node to be split in regression tress: 0.001; random number seed used to pick attributes: 1; K value: 0|
|REP Tree||Maximum tree depth: −1; minimum total instance weight in the leaf: 2; minimum likelihood of variance: 0.001; fold number: 3; seed number: 1|
The RF and REP Tree predictive results were obtained from the datasets for training and testing datasets. The MAE, RMSE and correlation coefficient (r) were subsequently determined on the basis of the Eqs. (4)–(6) shown in Figure 2 that depicts RF and REP Tree models performance, respectively. For the RF model the training data prediction is higher than the test dataset prediction. The r values for the training data and testing data are found 0.9935 and 0.8833, respectively. For the REP Tree model, the training data r value (= 0.9405) indicates marginally better results than that for the testing data (= 0.777). It is obvious to judge that the performance of RF model in training and testing datasets is higher than that of REP Tree model. Figure 2 presents bar graphs comparing the mean absolute error (MAE), the root mean squared error (RMSE), and the correlation coefficient (r) for both models’ training and test datasets. The MAE calculates the variance in the error term by term and reduces the significance of large errors; the RMSE value is more concentrated on large errors than on small ones. The RF model has lower MAE and RMSE values while higher r value, showing that in both training and testing datasets, the RF model provides adequate prediction of liquefaction-induced settlement. Additionally, the results of training and testing were shown in Figures 3 and 4, showing the projected settlements are plotted with the actual data. One can see that settlements were predicted more accurately by the RF model than by the REP Tree model. While the REP Tree model few settlements cases are relatively under predicted as compared to the RF model.
This paper explores the potential of RF and REP Tree models for predicting liquefaction-induced settlement using field data. The models were trained and tested based on the Pohang city liquefaction-induced settlement database. Both models assess liquefaction-induced settlement with substantial contributing factors such as depth, unit weight, corrected SPT blow count and cyclic stress ratio. The performance of the models presented is measured using statistical parameters such as the correlation coefficient (r), MAE, and RMSE. The RF model indicates a better performance with respect to the training and testing datasets. From this analysis it can be inferred that the RF model works well in predicting liquefaction-induced settlement as opposed to the REP Tree model. Since, artificial intelligence-based approaches are data-dependent and their output can vary depending on the dataset, the quality and number of training datasets and the size of the experiments. Finally, it is obvious that the proposed models are open to develop and accumulation of more data will provide much better evaluation of liquefaction-induced settlements.
The work presented in this paper was part of the research sponsored by the Key Program of National Natural Science Foundation of China under Grant No. 51639002 and National Key Research and Development Plan of China under Grant No. 2018YFC1505300-5.3.
Conflict of interest
The authors declare no conflict of interest.