Sources Identification of Water Inrush in Coal Mines Using Technique of Multiple Non-Linear Machine Learning Modelling

Water inrush is a major threat to the working safety for coal mines in the Northern China coal district. The inrush pattern, threaten level, and also the geochemical characteristics varies according to the different of water sources. Therefore, identifying the water source correctly is an important task to predict and control the water inrush accidents. In this chapter, the algorithms and attempts to identify the water inrush sources, especially in the Northern China coal mine district, are reviewed. The geochemical and machine learning algorithms are two main methods to identify the water inrush sources. Four main steps need to apply, namely data processing, feature selection, model training, and evaluation, in the process of machine learning (ML) modelling. According to a calculation instance, most of the major ions, and some trace elements, such as Ti, Sr, and Zn, were identified to be important in light of geochemical analysis and machine learning modelling. The ML algorithms, such as random forest (RF), support vector machine (SVM), Logistica regression (LR) perform well in the source identification of coal mine water inrush.


Introduction
Water inrush is one of severe hazards to coal mines in China.According to statistical material, more than 25 billion tons of coal resources are at the risk of water inrushes in China.From 2000 to 2015, 1162 water inrush accidents were reported, causing 4676 deaths.The number of accidents and deaths took 3.3% and 7.8% of all accidents in coal mines.In spite of the low proportion, major accidents often took place, leading to severe property and live loss.
Northern China district is an important coal base area, reserves of which takes nearly 40% of all country.Therefore, the prevention of water inrush accident is a key issue to the mining safety.The main threats of water inrush to the working face can be grouped into mainly four types, namely surface water, coal roof aquifer water, coal floor aquifer water, and goaf water.The coal roof water is usually relative to coal seam sandstone aquifers, sometime associate with quaternary aquifers.Goaf water formed when the working face closed and ground water filling up this space.The coal floor water is usually relative to limestone aquifers in the Ordovician system and the Taiyuan Formation in the Carboniferous system.
The different types of water inrush threats show various foreshadow, bursting behaviour, and hazard rating, and corresponding treating technology is essential.Therefore, the technique to predict and evaluate the accident potential, forecast the accident occurrence, and identify water inrush sources, is a key step to prevent the accidents or disasters, and protect the working safety and human health and lives.
In this chapter, the main techniques that used to identify water inrush sources and its application, mainly focusing on the Northern China district, are illustrated.

Methods and it's applications for source identification
A basic strategy to identify source of water inrush is based on the geochemical characteristics.Some researchers have compared concentrations of major ions, including K + , Na + , Ca 2+ , Mg 2+ , Cl − , SO 4 2− , CO 3 2− , HCO 3 − , and also total dissolved solid, between different aquifers to determine water sources.
In different aquifers, the water composition is a response of its original characteristics and water-rock interaction process.In the Northern China, two main groups of aquifers are coal bearing strata aquifers and limestone aquifers.The geochemists used to find key ions in water, sometimes using geochemical figures, to determine the water sources.
While the geochemical strategy is based on some unique ions and parameters in a lower dimension, another strategy, namely the machine learning (ML) algorithms, is based on multivariate analysis, including some specific methods, and provide more quantitate and reliable results.

Geochemical methods
The geochemical method is a popular technique in the water inrush identification, for mainly two reasons.First, some coal mines, especially for the large companies, have their own laboratory to test water geochemistry.Therefore, it is easy to obtain data.Second, the experienced technicians are familiar with the water geochemical data, especially for the major ions and important parameters.Researchers usually begin their study from the normal water geochemistry, to investigate water characteristics in every aquifer, set up identification model to distinguish water type from others, and find out the water-rock interaction mechanism for the water composition.
An easy-to-handle method to identify water source is to analyse the major ion characteristics.Cheng et al. [1] analysed water geochemistry in quaternary aquifers, magmatic aquifers, limestone aquifers in the Huaibei coal mine district, Anhui province.The data was grouped into different chemical types, which can be used as database for the water source identification.Chen and Gui [2] discussed water geochemistry in Wanbei coal mine district in Anhui province.Zhang and Cao [3] analysed ground water in Hancheng coal mine district in Shannxi province, founding that the potential water burst point was related with the limestone aquifer.Dai et al. [4] discussed water characteristics in Xiangshan coal mine in Shannxi province.The data was grouped using SPSS to set up a database for further coal mine monitoring and forecasting.The author's group have collected and analysed more than 30 water samples in the Lu'an coal mine district in Shanxi province, the pattern of water flow underground and water characteristics in every ground aquifer were summarised, some important ions, trace elements, and parameters, were identified and used to distinguish water sources from others.
A geochemical chart, the piper diagram, is usually used to analyse and group water samples into different groups by drawing the data as points in two triangle and a diamond figure.Zhang and Cao [3], Dai et al. [4] have applied this technique to identify the water sources.Author's group have collected samples in 2019, Table 1 shows part of the data, and Figure 1 shows the water geochemistry in a piper diagram.
As Figure 1 showing, water in coal bearing seam shows similar characteristics, Na + and K + take more than 80% and up to more than 95% of all the cations.TDS of the most water sample were less than 1000 mg.The limestone water shows a spanning pattern.TDS of the limestone aquifer water also showed a much wider range, from less than 500 mg to higher than 3000 mg.In the limestone aquifer, water volume is larger, and water-rock interaction is stronger than that in the coal bearing seam, which maybe the reason to the water characteristics in the limestone aquifers.
The source identification using basic geochemical technique is a qualitative, or semi-quantitative method, which may mainly depend on researchers' experiences.If distinguished differences between aquifers are observed in low dimensions, the basic geochemical technique is useful and easy to use.However, while the differences reveal in a higher dimension, i.e. the difference of ions' composition, this method system may lead to a confusing result.
Not only the major ions, but also trace element concentrations and isotope values can be used to distinguish one source from others.Some researchers have used the trace elements to distinguish water samples from others or set up discriminant models.Feng and Han [5] analysed concentration and occurrence of trace elements and modelled its formation using PHREEQC.Chen et al. [6] collected 24 samples from the quaternary aquifers, coal seam sandstone aquifers, and limestone aquifers in Wanbei coal mine district in Anhui province and tested 24 types of trace elements, including Be, B, Sc, V, Cr, etc.The samples and trace elements were clustered.Then eight trace elements, including Be, Zn, Ga, Sr, U, Zr, Cs, Ba, were found to be key parameters to set up discriminant model.The key trace elements were used to train Bayes discriminant analytical model with a good performance.
Isotopes are also used in the water inrush in the coal mines.The most popular isotopes are δD and δ 34 S of water.In recent years, the studies are applied in Wanbei coal mine district [7-9], and Fushun coal mine district [10], etc.In the author's research in Lu'an coal mine district in Shanxi province, the major ions and trace element were treated together, then SO 4 2− , Ti, Sr, Mg, K + Na, Zn, and Cl − were chosen to be typical ions or elements to train models.
Furthermore, the water form in a scale of whole water unit, therefore the analysis should be carried out in a scale of whole water unit, but not a single point.In the Northern China area, several ground units can be divided, the water-rock interactions among which show similar pattern in different coal mine district.Therefore, the analysis of coal mine district scale and comparison between different coal mine district is an important task to summarise the common mechanism of the waterrock interaction and distinction models.

Machine learning methods
The geochemical method is effective only if the water samples can be grouped and divided very clearly by one or very few parameters.In most scenario, the ion-distinguishing method is confusing and lack of accuracy.The difference of water samples is embedded in a high dimension, i.e. the combination of major ions, trace elements and other parameters.It is hard to find the dividing mode just by observation or simple drawing.Benefiting from the developing of data science and technology, the environmental and geological issues, including the ground water can be described, and divided by ML methods.
The ML algorithm can be simply divided into supervised, unsupervised, and semi-supervised, depending on how the target variables are labelled.For some environmental and geological problems, the target variables cannot be labelled, then the unsupervised ML algorithm, such as principal components analysis (PCA) Sources Identification of Water Inrush in Coal Mines Using Technique of Multiple Non-Linear… DOI: http://dx.doi.org/10.5772/intechopen.94288are applied.For example, Shan et al. [11] applied the PCA method to analyse the occurrence and leaching mechanism in coal and host rock, Pumure et al. [12] found out successfully of the occurrence of As and Se in coal host rock.Self-Organising Maps (SOM) is a kind of unsupervised artificial neural network (ANN) used in a large data amount scenario [13-14].The PCA algorithm is only used for the water inrush if the target variables cannot be labelled [15][16][17].
While the researchers carrying out their studies, discriminant models should be trained.The training data is obtained from the samples collected from every aquifer.In this step, the data is usually marked clearly.Therefore, the target variables can be obtained for most research cases, and the supervised ML algorithm can be used, which shows high precise and accuracy than the unsupervised ML algorithm.There are several algorithms are suitable for the model training, such as artificial neural network (ANN), support vector machine (SVM), discriminant analysis (DA), decision tree (DT), random forest (RF), boosting, and regression, etc.
In the Northern China, supervised ML algorithm has been used in several coal mine districts.Table 2 shows part of the research cases in the Northern China area in recent years.It can be concluded from the table that DT criterions are most implemented, some other methods, such as SVM, and ANN, are also used.

Supervised machine learning algorithm
Up to present, the DA is a most popular method analysis to identify sources of water inrush in the Northern China district.Two criterions are usually used, namely Fisher criterion and Bayes criterion.In the framework of Fisher-criterion based DA algorithm, high dimensional data is projected to a one-dimension space, then a discriminant criterion is obtained to achieve the maximum variance between two groups and the minimum in-group variance.Because this method is used to handle a two-group problem, many rounds of calculation are needed for a multiple-group problem.The Bayes-criterion base DA method calculate the posterior probabilities of the sample in each group, then the sample can be classified into the group with the highest posterior probability.Comparing with the Fisher criterion, the Bayes criterion is more frequently used.
The DA is a kind of linear algorithm.Along with the development of ML technology, non-linear modelling is widely used in researches, including the geological, environmental, and engineering area.In order to deal with problems of surface water and ground water, the SVM method is applied to predict water quality and water level [23,24], ANN and DT are used to predict the [NO 3 − ] of ground water [25], set up the water quality monitoring system [26].Boosting tree is also used to classify distributed water and ground water.
However, the non-linear ML method is relatively less applied to deal with the water inrush problems in coal mines, though higher accuracy maybe achieved
compared to the linear algorithm.According to literature research, the ANN [27] and SVM [20] have been implement in this area.The ANN is a very popular technique in many areas, including figure and voice identification, driverless driving, etc.However, the ANN usually needs large amount of data to train model to control its over-fitting problem.On the other hand, the data of the environmental and geological area, including the water inrush analysis are usually structured data, and limited to a small data quantity.As a result, the problem of over-fitting problem is hard to control, which means low accuracy of prediction is prospected while using the ANN model to check using the testing data, though a high accuracy may be obtained while testing the model using the training data.The algorithm of SVM perform better to control the over-fitting problem.Other than SVM, the DT, DT, boosting tree, Bayes network (BN) also have good prospect, in consideration of the characteristics of the coal mine ground water, i.e. structured small data quantity.

Data selection and feature engineering
The tested data of ground water is material of model training.However, the data preparing is essential to ensure or enhance the model quality.The data preparing work mainly includes data selection and feature engineering.performance, that's why the non-linear models are used.Along with the increasing of model complexity, the prediction error of the training samples becomes lower steadily.On the other hand, the prediction error of the testing samples gets lower at first, then higher again.That suggested over-fitting problem in the ML model.Therefore, the feature has to be processed if a good performing model is acquired.
The feature engineering includes feature fusion and feature selection.A common feature fusion method is PCA.The PCA can reduce dimension of data, then the features in a lower space could stand for most data information.As combination of the original features, the new feature cannot reflect the data characteristics directly.While the researchers want to analyse the importance of the parameters in the original data, the feature selection technique should be used.
Popular feature methods include RF, and Lasso regression, etc.The RF based feature selection undergoes the following steps.
1.The data set X contain N samples, draw samples randomly from the data set X using the bootstrap resampling method.The resampling is carried out k times, to construct k regression tree.In this process, the probability of no drawing of each sample is p = (1-1/N) N .The p tends to 0.37 while the N increasing to infinity.That means that about 37% of the samples in the data set X are not drawn, these data are not used in the DT training, calling out-of-bag (OOB) data.These OOB data is used to test the regression trees.
2. For k bootstrap samples, k unpruned regression trees are created respectively.In the training process, for each node, m attributes are randomly selected from the total M attributes as internal nodes.Then, an optimal attribute is selected from m attributes as a split variable to make the branches grow, according to the minimum Gini index principle.
3. The k decision tress comprises a random forest, the model quality could be evaluate using two indices: large mean square error of OOB (MSE OOB ) and low coefficients of determination (R 2 RF ).
( ) Where n is the total number of the samples, i ŷ is the predicted output obtained by the generated RFR regression model, i y is the observed output value, and the 2 y σ is the predicted variance of the OOB output.
4. The RF regression model provides two methods to determine the importance degree of each variable index: mean decrease in Gini index and mean decrease in accuracy.In a regression model, the mean decrease in Gini is usually used, and the mean decrease in accuracy is more applied for the classification problem.The water inrush source identification is a kind of classification problem, therefore the mean decrease in accuracy is selected.
While carrying out the inrush source identification, the attributes could be used in the model includes major ions, trace elements, important parameters, and isotopes, etc.In which, the data of major ions and important parameters are easier to obtain.On the other hand, adding of trace elements and isotopes into the models may enhance the model performance, for these parameters carries a lot of information of the water samples.In consideration of easy using, only major ions and important parameters is used, while considering for the model accuracy, more parameters could be added.Therefore, it is a balance need to consider while building models.
In our previous study in the Lu'an coal mine district in Shanxi Province, all the prescribed parameters have been tested.In the first step, the feature selection was applied on the major ions and important parameters.Feature selection result of all data using RF algorithm.

Figure 1 .
Figure 1.Piper drawing of the ground water (In the figure, the squares stand for surface water, the triangles stand for quaternary water, the circles stand for coal bearing seam water, and the stars stand for limestone aquifer water).
In a wide sense, the data selection includes data cleaning, which means treatment of unit and missing data.Then the data should be selected to determine those used in the model training step.The data selection is applied in two stages, before model training and after model training.Before the model training, the suitable data for the model training means to make sure all the data is labelled correct.Uncorrected marked data leads to wrong model definitely, regardless of the quality of models.After the model training, the training data should be checked again.The data have to be checked very carefully to find wrong classified data.While it is determined to wrong pre-labelled, then the data should be deleted, and new model need to be trained.The other important work before the model training is feature engineering.The basic mechanism to process feature engineering is to achieve a best performance of the model.Figure 2 shows the idea of feature engineering.Number of features, or parameters, means the model complexity.More features in the model lead to a higher complexity of the model.As Figure 2 showing, the prediction performance is related to model complexity.A very simple model lead to very bad model

Figure 2 .
Figure 2. Correlation of the model complexity and prediction error.

Figure 3 Figure 3 .
Figure 3. Feature selection result of major ions and important paraments using RF algorithm.

Figure 4 .
Figure 4.Feature selection result of all data using RF algorithm.

Table 1 .
Major ion data in ground aquifer (mg/L).