Data Mining for Source Apportionment of Trace Elements in Water and Solid Matrix

Trace elements migrate among different environment bodies with the natural geochemical reactions, and impacted by human industrial, agricultural, and civil activities. High load of trace elements in water, river and lake sediment, soil and air particle lead to potential to health of human being and ecological system. To control the impact on environment, source apportionment is a meaningful, and also a challenging task. Traditional methods to make source apportionment are usually based on geochemical techniques, or univariate analysis techniques. In recently years, the methods of multivariate analysis, and the related concepts data mining, machine learning, big data, are developing fast, which provide a novel route that combing the geochemical and data mining techniques together. These methods have been proved successful to deal with the source apportionment issue. In this chapter, the data mining methods used on this topic and implementations in recent years are reviewed. The basic method includes principal component analysis, factor analysis, clustering analysis, positive matrix fractionation, decision tree, Bayesian network, artificial neural network, etc. Source apportionment of trace elements in surface water, ground water, river and lake sediment, soil, air particles, dust are discussed.


Introduction
On the issue of trace element contamination of environment, the trace elements refer to the elements with lower concentrations than the major elements, O, H, Si, Al, Fe, Ca, Mg, Na, K, Ti, which are usually take no more than 1% in rocks and minerals. The trace elements have attracting wide research attentions for their high potential on environmental contamination and health impact. In some articles, the phase heavy metals are frequently used to represent elements that have high density or is toxic or poisonous at low concentrations. From the view of environmental impact, the phases trace elements and heavy metals refer to similar research objects, which are used as group name for metals and metalloids that have been associated with contamination of water, river sediment, soil and air particles and potential toxicity and ecotoxicity. In this chapter, the phase trace elements (TEs) are used to present the elements that may cause contamination and health problems, and

Methods for data mining
To investigate trace element concentration in time series and spatial distribution, migration source, and reaction pathway, technology of data mining is used. In narrow sense, the data mining refers to using multivariate analysis and machine learning method to find distributing or changing pattern in big data sets. In a broader concept, the data mining may include more techniques, such as geochemical, isotopic, univariate analysis, etc. In this chapter, techniques of multivariate analysis and machine learning are emphasized for its increasingly application and effective in source apportionment and reaction path analysis. Table 1 lists application of data mining methods and implementation on the trace element migration. In which, PCA stands for principal component analysis, PCA [48] Soil (dust) China Pb, Cd PCA/ANN [49] Soil India Ni, Co PCA [50] Soil (city topsoil) Armenia Pb, Zn, Cu, Mo PCA/CA [51] Soil (atmospheric deposition) China As, Hg, Cu, Cd, Mo, S, Zn, Cr, Ni, Pb, Se PCA/CA [52] Soil Spain Pb, Tl, As, Sb, Cd, Cr, Ni, Be, V, Co FA [53] Soil Pakistan Ni, Cr, Zn, Cu, Pb, Cd, Co PCA/FA/ CA [54] Soil (agriculture) Greece Cu, Pb, Zn, As, Cd, P, K PCA/CA [55] Soil Italy Ni, Cr, Pb, Zn PCA/FA-MLR [56] Soil China Cd, Hg, Pb, Zn PCA/CA [57] Soil Iran -Semi-supervised ML [58] Soil USA -(Six models) [59] Particle India Ni, Cu, Pb, Cd, Cr PCA/CA [60] Particle (PM 2.5) Canada -PMF [61] Particle (PM 2.5) China Cr, Mn, Fe, Cu, Zn, As, Pb, Ba Regression/Monte Carlo [62] Particle (PM 2.5)

Geochemical methods
To analyze geochemical properties, and reaction mechanisms, mass balance, piper diagrams, Gibbs diagrams [28] are usually applied. The piper diagram shows the major element composition of water, which category water into different types. Several software, PHREEQC, MINTEQ, geochemists' workbench, can be used to calculate mass balance, saturated index, and model the reaction path, draw piper diagram, etc. By the process of water-rock interaction, major elements and TEs may be released and immigrate to other water bodies, therefore the major elements and trace element with distinguishing feature could be used as source apportionment [69,70]. However, the TEs undergo geochemical process of adsorption, desorption, mineralization, dissolution to change concentration in water. Therefore, the univariate analysis is not credible and robust. Comparatively, isotopic analysis, both stable and radiogenic [30], may be used as a univariate analysis method or combing some other indexes. The widely used isotopic method are δ 18 O and δD in water, 87 Sr/ 86 Sr, δ 34 S and δ 18 O in sulfate [71,72], δ 15 N and δ 18 O in nitrate [29], etc. The isotope δ 18 O and δD in water are used to identify water relations between precipitation and surface/ground water. The ratio of strontium isotope of water is strictly controlled by water-rock interaction. For a unisource water, the 87 Sr/ 86 Sr reflect the mineralogy of the rocks with which the water has been contact and does not change along the water flow. It is highlighted that differences in the strontium isotope ratio and strontium concentration are caused mixing of water of various origins with specific chemical characteristics and isotopic values. Therefore, the strontium isotope is an ideal tracer for element resources, groundwater movements, and waterrock interaction [70,[73][74][75]. The use of δ 34 S and δ 18 O in sulfate is increasing because they have wide range of stable isotope composition and the δ 34 S value is derived from multiple sources and very close to that of the precursor sulfide mineral. A common anthropogenic source of sulfate is the coal and metal mining which is rich in pyrite and other sulfide minerals. In activity and abandon coal mines, the sulfide mineral may be oxidized and dissolved, with the release of trace elements. It has been proved by the sulfate isotopes that the ground water could be contaminated by the water-rock interaction in coal mines. Besides of the natural isotope, some tracers such as isotopes and stable organic compound are injected into groundwater to find out the flow pathway [71,72,76,77].
To evaluate contamination of TEs and find the source of pollution on water and solid matrix, some calculations are used. The enrichment factor (EF) is an enrichment level of a certain TEs in environment, with an equation as shown in Eq. (1): where the c i is the measured concentrations of TEs in samples, the c ref is the measured concentration of the reference element, B i and B ref are the background level of the local region and reference element in the same region [41]. An EF value close to 1 suggests a weathering origin of trace element, while a higher than 1 value means TEs enrichment in soil which is probably caused by human activities. An EF value between 2 and 5 indicate a moderate contamination, and a higher than 5 value show a heavily polluted by TEs [49]. The EF factor is frequently used combing wit data mining method, or as a verification to trace the contaminating sources.
The geo-cumulative index method (I geo ) is defined using Eq. (2): where the c i is the measured concentration of TEs, and the B i is the background concentration of the particular TEs [41].
The Hakanson potential ecology risk method (RI) was proposed by Hakanson and can be used to evaluate the potential ecological risk posed by TEs in water and solid matrix. This comprehensive method considers four factors: concentration, type of pollutant, toxicity level, and the sensitivity of the water body to metal contamination in water and solid matrix [78,79].

Machine learning
Studies of environmental processes exhibit spatial variation within data sets. The ability to derive predictions of risk from field data is a critical path forward in understanding the data and applying the information to land and resource management. Multivariate analysis, or machine learning methods present advantages of precise, robust, and can look insight the phenomena to find mechanism. On the other hand, the environment data usually composed of matrices. Therefore, the machine learning methods is an ideal tool to deal with environmental and geochemical issues. However, the calculation of machine learning and multivariate analysis are complex, which may prohibit their implementation. Thanks to recent advances in predictive modeling, open source software (R, Python, SPSS, SAS, Minitab, etc.), and computing, the power to do this is within grasp.
Basic principle of ML is to train models for the specific data frame using the obtained data, then apply the models on the target problems. The ML methods can be divided into three types, namely supervised, unsupervised and semi-supervised learning. When the training data has labeled data, it is a supervised ML, while unsupervised ML have no labeled data. The semi-supervised ML add labels to data during model training. Generally, the supervised ML has higher precise and robust than others. However, the geochemical and environmental data are usually unlabeled. For example, when the researchers try to identify source of water, or TEs in water and solid matrices, the results are usually not assured. Therefore, the unsupervised ML, up to present, has more widely implemented than the supervised and semi-supervised ML.

Unsupervised ML
It is undoubted that the unsupervised ML is mostly used in this area. Common techniques of unsupervised ML include: principal component analysis (PCA), factor analysis (FA), clustering analysis (CA), positive matrix factorization (PMF), etc.
In the scope of machine learning algorithm, PCA is a tool to reduce high dimensional matrix to a lower, usually two to five, dimensional matrix. Dimensional reduction is accomplished by transforming the data to a new set of variables (principal components), which are derived from linear combinations of the original variables and classified in such a way that the first principal components are responsible for most of the variation in all of the original variables [80].
A matrix M, with m observations (row) and n variates (column), is calculated following five steps to form a new matrix with less variates.
Step 1: the raw data in the M is standardized; Step 2: the covariance matrix of the standardized M is calculated; Step 3: eigen value and eigen vectors of the covariance matrix are calculated; Step 4: contributing ratio and accumulative contribution of the eigen value was calculated, then the principal components can be determined according to mathematical and project criterion; Step 5: loading of every principal component and score of every observation can be calculated.
Theoretically, the number of new variates is equal to variates of the original matrix. On the other hand, the new variates contribute different ratio to explain variance of variates, then the principal components are selected based on the explanation ratio. Different criterion was used, some researchers use the eigen value larger than 1, and some others use the accumulative contribution of the eigen values, say 80%.
After the calculation of PCA, some variates have higher loadings on specific principal components, while some variates have higher loadings on other PCs. Then it is inferred that the variates have similar pattern in the matrix may have similar pattern in the real world, i.e., the source, migration behavior, and reaction pathway. Theoretically, the PCA is similar with clustering analysis, but the PCA is not constrained to two dimensions, which allow the researchers mine the inner relationships in the matrix and understand real world more precise.
The Factor analysis (FA) is based on PCA, have similar principle, and aim to obtain similar result with PCA, but the applications are less than PCA.
The data on every variate should be normal distributed. Kaiser-Meyer-Olkin (KMO) test and Bartlett's sphericity test are usually used to determine the distribution of data for analysis of PCA/FA.
On the issue of source apportionment of particle matter and trace elements for the suspended particles and trace elements inside, a method of positive matrix factorization (PMF) is usually used. When we have a matrix M, with f of observations, and n of variates, the M can be calculated as Eq. (3): In a source apportionment problem, W stands for source contributions, H stands for source profiles, k is the number of possibly sources. The least loss function determines the proper k value and the matrix W and K, then the source quantity, contribution ratio can be inferred. The W f * k need to be normalized by their average value across all samples as shown in Eq. (4): where w fk is elements in the matrix W. The PMF is usually used in source apportionment for particle, such as PM 10 and PM 2.5 [61,[66][67][68], but seldom used in other environmental medias.

Bayesian network
A model of Bayesian Network has been implemented to estimate TE source contribution, and evaluate the contaminating levels [26,29,[84][85][86]. A R package SIAR (Stable Isotope Analysis in R) can be run to calculate the isotope mixing model base on the Bayesian Network. The mixing model can be elucidated as the equation set Eq. (5): where X ij is the isotope value j of the mixture i, in which i = 1, 2, 3, …, N and j = 1, 2, 3, …, J; S jk is the source value k on isotope j(k = 1, 2, 3, …, K), c jk is the isotope fractionation factor for isotope j on source k. p k is the proportion of source k, which needs to be estimated by the SIAR model. The S jk and c jk are normally distributed with mean μ jk and standard deviation ω jk , mean λ jk and standard deviation τ jk , respectively. ε ij is the residual error representing the additional unquantified variation between individual mixtures and is normally distributed with mean 0 and standard deviation σ j . Algorithm of Monte Carlo is usually to solve the equation of Bayesian network.

Decision tree
The decision tree is a kind of supervised ML, including a series of machine learning techniques to divide samples into different categories, such as algorithms ID 3, C 4.5, C 5.0, CART, etc. Different algorithms follow the same principle: the observations are divided by breakpoints on a variate. The selection of variates and breakpoints provide basis of decision, and all of the decisions make a tree for users to make a project decision system. Take the algorithm CART for example, the sample space needs to be split by using the variate breakpoints. Different split strategies make different decision efficiency. As an index, the Gini Coefficients are use. The decision tree machine will calculate Gini index for every split, the split method with lower Gini Coefficients is used. Advantages of the decision tree are easy to carry out and easy to explain. Some researchers have introduced the method to trace source of nitrogen and TE contaminations [38].
Some decision tree series methods, including random forest, boosting method are widely used in the data mining [59]. However, the applications in the source apportionment are rare. The obstacle that prohibit the implementation of the decision tree methods may be the acquirement of data labeled.

Artificial neural network
The artificial neural network (ANN) has been recognized as a powerful supervised ML and applied in a wide scope of engineering and research. The ANN research is one of the most active research in the ML algorithm, which have a lot of branches. It is also the basis of deep learning, which is used as figure and voice identification.
In the area of environmental and geochemistry, the implementation of ANN is not as popular as unsupervised ML. The most important reason is that it is a kind of supervised data mining method. In the data preparation step, the observations need to be labeled, while the environmental data are usually cannot or difficult to label. However, some researches have used this to predict the contaminating potential, Mclean et al. reviewed the application of ANN on ambient air pollution [87]. The second obstacle for its implementation is that ANN usually need larger amount of data than PCA, decision tree, and some other methods. A basic ANN model has an input layer, one or several hidden layer, and an output layer. Variates of one observation are input through the input layer, and the output layer are labeled data, the hidden layer are used to calculate the model from input to output. The relationship of the layers is trained by input the variate data and labeled data. A trained model is used to predict while input data are obtained. Once the labeled data can be obtained, the ANN model is useful to predict, discriminate, divide samples with trained model in the engineering and research of geochemical and environmental purpose.

Discrimination analysis
The discrimination analysis (DA) is a kind of supervised ML, because the data set is labeled in the model training step. The DA has a similar concept with the principal component analysis. While the PCA tries to find principal components (new axis) to stand for the most variates, the DA tries to find axis that stand for the least variates, so that the different variates and observations can be divided. Prior to the application of DA, all of the variables were standardized to ensure that scale differences between the variables are eliminated. Hence, the absolute discriminant weights ranked the variables in terms of their discriminating power, i.e., the variables with large weights are those that contribute most to differentiating the groups.
In the model training step, if the origin of samples can be identified, then the DA could be used to identify sample source. For example, water intrusion in coal mines may come from different ground water aquifer, the different aquifers have specific geochemical characteristics and hazard level. As a water hazard control work, discrimination models can be set up by train the labeled data. The labeled data means water collected from different aquifers. Once water intrusion happens, the water characteristics is used to identify source of water by comparing with the model [27,33].
The criterion applied in the discrimination analysis are mainly distance based or Bayes rule base. When the distance rule is used, Manhattan distance of samples to different groups are calculated, a group with less distance to the samples is labeled to the samples. The distance has to be calculated in pair, which constrains efficiency of the model training and implantation. Another popular method is called the Bayes discrimination method. A reasonable way to discriminate the group of a characteristic sample is to compare the conditional probability of the characteristic sample falling in different category. The class with the highest conditional probability is the final category result of this characteristic sample. Theoretically, the Bayes DA has a higher coefficient and accurate than the distance-based DA.

Semi-supervised machine learning
The unsupervised ML are easy to carry out, but low in accuracy, robust, reliability and duplicate, while reverse for the supervised ML method. As an improved strategy, the semi-supervised ML is an option. When the labeled data is not easy to acquire, and need to do unsupervised ML at first, then the semi-supervised algorithm may apply to add labels for the data while the model is being trained. Vesselinov et al. used the non-negative matrix factorization method for blind source separation in the first step, then a semi-supervised clustering algorithm was used to predict the sources of contaminates [37]. Fatehi and Asadi used a hybrid method combining hieratical clustering and fuzzy c-means clustering to classify soil types [58]. At present, this method used in this topic at present is rare.

Regression
The regression is easy to use, explain, and understand. Also, the regression is a big too box in the machine learning workshop. The most popular method is the multivariate linear regression, sometime logistical regression, lasso regression, ridge regression, plastic net regression can also be used. The regression method can be combined into other machine learning techniques, such as decision tree [56], support vector machine, etc. However, the regression has very distinction shortages. In a regression process, the model of data is fitted to a linear or curve function, which may not accord with the real situation. Second, the regression is prone to overfitted, while the training mode performed well, disaster results may be gotten when applied in real environment. To solve this problem, lasso, ridge, and plastic net regression are applied. Besides the two issues, another problem may bother the application of the regression, the data are usually not easy, or cannot to label. In this situation, unsupervised techniques should be used. Once the labeled data are acquired, regression method are applied [31,56,59].

Artificial tracers
In order to find ground-surface, ground-ground water relationship, artificial tracers are also used. The chemical traces sodium chloride, eosine, uranine and pyranine were used to analyze spring-ground water relationship. Conductivity meter and thermometer was yet installed for electrical conductivity (EC) monitoring and field fluorimeter was equipped for tracer detection [31].

Other methods
In a research from Alaska America, six models were set up to predict soil contamination. The model includes random forests, generalized boosted regression, elastic net regression, multivariate adaptive regression splines, generalized linear model with stepwise selection using Akaike's information regression, and partial least squares regression. Although got similar explanatory power overall among the models, the machine learning models performed much better than the linear models on predictive accuracy and were better able to identify variables of interest and describe non-linear relationships. In order to understanding the mechanisms behind trace element pollutant fate and transport and were less vulnerable to errors of omission, the machine learning techniques have priorities than the linear models [59].

Implementation of the data mining of TE source apportionment
The environmental medias that may be contaminated by trace elements are grouped into four types, water, sediment, soil, and particles in this chapter. In every case, the probably sources of trace elements are listed in order of importance. The main method to be used are listed in Table 1.

Contamination sources of surface water
The TEs migrate from rock and coal to water through water-rock interaction. Then the surface and ground water may be contaminated.
In Turkey, TEs source in a large reservoir was identified. The PCA showed that PC1, PC2, and PC3 includes Co/Cr/Fe, Cu/Pb/Zn, and As/Cd, respectively. Combing with correlation analysis, the three PCs were identified to natural source, bedrock weathering, and bedrock weathering, respectively [25]. Another research revealed by PCA that mineral pollution, nutrient pollution, and organic pollution are major latent factors which influence the water quality of Asi River [88].
Because of the vandalization of pipeline, soil and water may be contaminated. The PCA results showed that the first source was associated with anthropogenic source, such as vehicular emission, which was composited by Cd, Cr, Pb, and Mn. The second source, including Cu and Zn, was related to natural geological origin, and the Ni and V were released from natural source collaborating with the petroleum contamination [12].
In Ethiopia, water samples were divided into four categories by clustering analysis: natural cluster, mixed cluster, agriculture cluster and urban cluster. In the agriculture cluster, VF1 has strong loadings on TN, NO 3 À , salinity, Fe, NH 3 , hardness, and Mn, which is cultivated originated, VF2 were associate with turbidity, Chl-α, and Cu, which may come from farming and excavation sites of quarrying activities. Mg, and K were mainly loading on VF3, and VF4, respectively. K is mainly spread while potash fertilizer is used [24].
Supervised ML technique, discriminant analysis, was applied with the clustering analysis to assort and find spatiotemporal distribution of trace element in surface water, in the USA. Sources of salt ions (magnesium, chloride, and sodium) vary from natural sources (oceans, atmospheric deposition, weathering of common rocks, minerals and soils, and salt deposits and brines) to anthropogenic sources (landfills, wastewater and water treatment, agriculture, and application of deicing salts) [27].
A Bayesian isotope mixing model was used to estimate proportional contributions of multiple nitrate sources in surface water in Belgium. The result showed that "manure and sewage" contributed highest, "soil N", "NO 3 À fertilizer" and "NH 4 + fertilizer and rain" contributed middle, and "NO 3 À in precipitation" contributed least [26].

Contamination sources of ground water
In southern India, potential TE source of ground water was analyzed, it was concluded that Fe and Mn were natural origin, Cr, Cu, Pb and Ni may come from mixed sources, natural and flow contaminated with fertilizers and pesticide. In another study of northern India, the sources of ground water were identified to be anthropogenic source via agrochemical and industrial wastes (As, Cd, Co, Pb and V), parent material from an adjacent area (U and Sr), lithogenic origin (Fe, Mn, Zn), and background level elements (Mo and Se), respectively [36].
In Greece, Matiatos et al. [28][29][30] investigated surface water and ground water combing the method of geochemical, isotopic and multivariate statistical analysis, such as PCA and Bayesian isotope mixing model. By the PCA analytical result, EC, Na, K, Cl, and Mg, were found to be seawater inrush origin, Fe/Zn, Ca/hardness were from water-silicate rocks interaction, and dissolution of limestone, respectively, NO 3 stand for nitrogen pollution, and SO 4 2À and Mn were from dedolomitization process and increased agricultural input.
A semi-supervised ML technique was used to trace contaminants' source in the USA. Vesselinov et al. [37] proposed a contaminant source identification approach that performed decomposition of the observation mixtures based on non-negative matrix factorization (NMF) method for blind source separation (BSS), coupled with a custom semi-supervised clustering algorithm. As a result, the mixing coefficients of all the groundwater types (contaminant sources) for each observation well (samples) were obtained.
As a supervised ML technique, decision tree is used combing with isotope method in a study to determine nitrogen source in groundwater. The decision tree has made 97.5% success in the water quality analysis. However, concentration data alone could not identify the dominant NO 3 À sources for groundwater contamination. It is suggested that an integrated approach should be setup by the combination of the N and O isotopes of NO 3 À with land-uses and physical-chemical properties, especially in areas with specific activities [38].

Contamination by coal mine water
One of the most focused issues of surface and ground water contamination is the acid mine drainage (AMD). The AMD is formed when pyrite and other sulfide minerals oxidized and dissolved during coal and metal mining, highway construction, and other large-scale excavation [13]. In an anaerobic environment, the sulfide minerals are stable, while exposure to water and oxygen, and with other accelerating factors such as bacteria, they are oxidized to form sulfuric acid, accompanying release of trace elements to surrounding water bodies [89]. In coal mine water, some of the drainage is alkaline, the leaching behavior and the TE composition in the leaching water are different [14].
Mobility of the TEs in AMD depends on several conditions. First, what is the TE occurrence and abundance in the potential AMD source; second, during the waterrock interaction process, where and how the adsorption-desorption, dissolutionprecipitation take place; third, what are the main flow path, and the river and lake geochemistry where the TEs may be adsorbed or released again.
If the flow path is known, the source and reaction rates of specific trace elements can be estimated by mass balance calculation. The post-dissolution behavior of TEs is controlled by solution composition, pH, Eh of the water, temperature, and contact-time with mineral surfaces. For example, metal elements will have little attenuation in the solid phases, and high mobility potential into water. The versus behavior can be observed for the metalloid elements. Along with the flow path, water geochemical characteristics and pH and Eh of water changes, the TEs may undergo very complex reaction process, leading to redistribution of TEs in surface water, ground water, and sediment in the water bodies. Therefore, the source identification of TEs is an important and challenging work.
The pH of AMD ranges from 2 to 8. In an acid environment, metal element, Pb, Cd, Cu, Ni, have high mobility, while some metalloid element, As, Se, tend to migrate in an alkaline environment. Damaging effects of AMD are reported in Asia [34,[90][91][92][93], North America [72,[94][95][96], Europe [83,97], South America [47,98,99]. When AMD enters surface water bodies, the effects include biotic impacts on stream and lake organisms through direct toxicity, habitat alteration by metal precipitates, visual changes from orange or yellow staining of stream sediments, nutrient cycle disruptions, or other mechanisms, and the water often becomes unsuitable for domestic, agricultural, and industrial uses. Gammons et al. have found the contamination of abandon coal mines on ground water using method of isotope analysis [71,72].
The TEs in coal are not only migrate while mining and dumping of gangue and dust deposit [100], but also accompanying spread by smoke, fly ash, bottom ash when combustion [101,102]. Trace elements, As, Cu, Se are found concentrated in the fly ash, which indicate impact on water and soil quality [103,104].

Source apportionment of water inrush in coal mines
The TE source apportionment technology is used in coal mines to determine the source of water inrush [32]. In coal mine, water inrush constantly threatens the production, human health and cause financial losses. The water inrushes are cauterized to four sources: quaternary sand-gravel pore aquifer, Dyas sandstone aquifer, limestone aquifer from Ordovician and Carboniferous, and abandoned coal mine districts, respectively. Different sources show varies features and need different treatment strategies. The main purpose of the water inrush analysis is to find categories of source aquifers. Huang et al. [32] proposed a technology system, Piper-PCA-Bayes-LOOCV discrimination model to determine water inrush types in coal mines. The piper diagram is a geochemical technique to show the water characteristics, and abnormal samples/points were screened in this research. PCA was used to lower dimension of the sample matrix, to make less variates standing for all the original variates. Then the supervised ML model, Bayes DA, is used to train and implement a model for water source discriminant. LOOCV means leave-one-out cross-validation, to validate and improve quality of the model. Wang et al. used discriminant analysis to determine water bursting source in coal mines [33].

TE occurrence and reaction pathway
The PCA method has also used to investigate trace element occurrence in rock/ coal, and reaction pathway, which may be the source of TEs that have contaminating potential on surrounding water bodies. Shan et al. [34] found that in coal host rock seam, Se/Cd/Hg/ As occurred in sulfide minerals, Be and V occurred in carbonate minerals, Cr and Pb occurred in clay minerals, respectively; while in coal seam, Se/Cr/ Pb occurred in clay minerals, As and Hg occurred in sulfide minerals. Se, As and Hg immigrated through dissolution of sulfide minerals, Cr immigrated through transformation of clay minerals in coal host rock. In coal seam, As and Hg occurred in sulfide minerals. Se, Pb and Cr immigrated through transformation of clay minerals, As and Hg immigrated through dissolution of sulfide minerals, respectively. Pumure et al. [105] investigated occurrence of selenium and arsenic in coal by the method of two step PCA, founding that ultrasound leachable selenium concentrations were associated with 14 Å d-spacing phyllosilicate clays (chlorite, montmorillonite and vermiculite all 2:1 layered clays) whilst ultrasound leachable arsenic concentrations were closely related to the concentration of illite, another 2:1 phyllosilicate clay.

TE apportionment in sediment
Surface water and sediment compose a reaction system, trace elements in water may be adsorbed by sediment, meanwhile, trace elements in sediment are released. Therefore, the sediment may be a sink or origin of trace elements. Because of the complex reaction pathway, and environmental persistence and biological accumulation, the trace elements in the aquatic environments has drawn special attentions [106].
In southwest China, lake sediment was analyzed [42]. PCA result showed that Cd/Hg/Pb/Zn, and As (as PC2 and PC3, respectively) were mainly from non-point anthropogenic sources, especially with the atmospheric emission from non-ferrous metal smelting and coal consumption [107].
In Jiangxi China, river sediment was investigated. As the metal mines are excavating in the study area, metal element contamination was found. The PCA analytical result show probably coal and gold mining, copper mining and refining, Zn/Pb deposits and agricultural activities origin associated with PC1, PC2, and PC3, respectively. The PC1 were high loaded with Ni, Hg, Cr, the PC2 were high loaded with Cu, the PC3 were high loaded with Cd, Pb, Zn, As, respectively [39]. A research on lake sediment in Jiangxi China showed that Cr, Pb, and Zn may be mainly derived from both lithogenic and human activities, such as atmospheric and river inflow transportation, whereas Cu and Cd may be mainly contributed from anthropogenic sources, such as mining activities and fertilizer application [43]. In northern China, Cd and Zn are found originating from agriculture source and Cu, Cr, Ni were natural source origin [40]. In northwest China, Zn, Cu, Ni, and As were high loaded on the PC1, and natural originated, Cr and Cd/Pb/Hg are high loaded on the PC2 and PC3, which were township/silicon chemical factories, and agriculture/ urban construction origin, respectively [41].

TE apportionment in soil
Researches have focused on distinguish TE source from natural and anthropogenic [108,109] contaminates in soil. The major natural contribution of heavy metals comes from the parent materials from which the soils developed. The anthropogenic source of heavy metals in soils includes acid mine drainage [110], agricultural and industrial waste discharges [111], atmospheric deposition [112], fertilizers and pesticides [113], which has a significant contribution to the content levels of heavy elements in soils. PCA are now a popular technique to trace source of TEs in soil, then enrichment factors are usually used to verify the sources. In order to investigate source of TEs and spatial distribution, the combination method of geochemical, multivariate analysis, and geostatistical analysis. GIS and multivariate analysis of soil contamination has been detailed reviewed [114]. Understanding sources of heavy metals in surface soils is imperative for the decision to implement the strategies for protecting the food safety, human health and ecosystem sustainability.

TE apportionment in agricultural soil
In Greece, two main sources explained 74.8% of all the variance for the agriculture soil contamination analysis. The TEs Cu, Pb, Zn, As, Cd, P and K were identified to be anthropogenic influence, and TEs Ni, Co, Fe and Cr were recognized to be natural source origin [55].
In soil samples on hills in India, four principal components were determining by using the PCA method, high loading TEs on which are Mn/Zn, Cr, Ni, Co, respectively. The PC1 and PC2 were inferred to be natural sources, and PC3 represent fossil fuel burning origin, which contribute most of Ni in soil, and PC4 represent irrigation sources, respectively [50].
In Shanxi province China, soil samples were collected in an area of 25k km 2 . The PCA analytical result showed that Co, Cr, Cu, Mn, Ni, Se, V, and Zn were mainly originated from natural source, and Cd and Pb were affected by anthropogenic pollution heavily. Associated with the spatial data, Pb were strongly associated with road traffic, and Cd were linked to industrial activities. In order to predict Pb and Cd concentration in the following years, an ANN model was applied [49].
In Beijing China, the TEs can be represented by two PCs. The TEs Co, Ni, Cr and V were probably released from parent material of the soil, Cd, Cu, and Zn were primarily from agricultural cultivation. Hg may be originated from coal combustion or mineral fertilizers [45].
In Jijin China, Al, Fe, Mn, Zn, Cr, Ni, As, Cu, and Pb was found accounting for 55.16% of the total variance, which was identified as natural source. N, OC, P, Cd, and Hg have high loadings on the PC1, accounting for 16.75% of the total variance. The PC2 also seemed as natural source, high relationship of Hg and Cd was explained to high organic affinity [46].
In Iran, a kind of semi-supervised ML method was applied. The study area was located around a Cu-Au porphyry deposit, so the soil may be associated. Initially eleven soil geochemical variables were selected by using hieratical clustering analysis and expert knowledge. Then, the semi-supervised fuzzy c-means clustering method (ssFCM) was used to separate multivariate soil geochemical anomalies from back-ground for further drilling [58].

TE apportionment in urban and industrial top soil
The impact of ore deposit on surrounding soil was investigated in Beijing China. Frequent mining activities produce dust, acidic drainage from the oxides and mill tailing. Cu, Co, Zn, Cd and V was found to mixed sources originated; Be, Pb and As came from natural sources and are mainly affected by the weathering and erosion of parent rock material; Cr, Ni and Ba were polluted by fine particle, industrial and mining activities; transportation and soil minerals were the common sources of Cr and Ni; Hg came from anthropogenic sources, mainly impacted by mining, beneficiation, smelting and acid mine drainage waste [44].
Large urban and industrial areas along the coastline in Italy was investigated. Pb and Zn due to heavy traffic and alloy production. Some Cr and Ni contamination were discerned through releases from tannery industry. Zn and Pb enrichment were mainly related to the large volcanic complexes. Cr and Ni were enriched in the siliciclastic deposits [56]. Another large-scale investigation was carried out in Yangtze river delta China, industrialization lead to high contamination potential on environment. Four PCs were selected to present the sources of trace element. As, Hg, Cu, Cd, Mo, S and Zn are recognized as traffic origin. Fe and Mn were from natural resources. The PC3, including Cr and Ni, pointed to pyrometallurgical processes, especially non-ferrous metal industries, etc. The PC4, composited by Pb and Se, was inferred to be coal combustion originated [52]. Another research carried out in this area showed that Cr, Ni, Co, Mn, Cu, and As were mainly came from natural sources. Cd/Hg and Pb/Zn originated from anthropogenic sources in two different groups [57].
In Shaanxi province China, roadway dust was analyzed. TEs Zn, Mn, Ni, As had the highest variance. Because Zn was released mainly from wear vehicle tire and corrosion of galvanized automobile part. Cu, Pb, and Cr was inferred to traffic origin. The third source was dominated by Co and Ni, and they were released from machine manufacturing plant [48]. In a research from Alaska America, soil contamination was found to be caused and controlled mainly by distance to road, traffic category, including highway and refuge road, land cover category, paved or not, land cover category, traffic loading, and other parameters, in descending order [59].
In Pakistan, four factors were identified using the factor analysis to trace surface soil contamination in industrial cite. VF1 contains Ni, Cr, Zn, and Cu, which originate from vehicular emission and industrial activities. VF2, compositing by Pb, Cd, and Co, originated from anthropogenic activities such as automobiles. Fe, Mn, standing for VF3, and VF4, were natural source origin [54]. In Armenia, Ti, V, Mn, Fe and Co, were identified to be natural originated. The PC2 include two distinguished negative groups, As/Hg, and Pb/Zn. The PC3 is composited mainly with Cu and Mo, and recognized as anthropogenic origin [51]. In Spain, the first source, including Pb, Tl, As, Sb, Cd, pointed to coal combustion. The second source was traffic air pollution origin, which released Cr, Ni, Be, V, Co. The third and fourth factors explained a very low proportion of variance and were considered secondary. These factors included TEs Cu, Zn and Sn, showing mixed behavior with regard to the first two factors [53].
In Nigeria, because of the vandalization of pipeline, the soil and water may be contaminated. The PCA gave 78.68% of accumulative contribution of the covariance from the first three PCs. PCA analysis result in soil was similar with that in water [12].

TE apportionment in soil to recall human activities
In Spain, core was obtained from peat bog, to evaluate trace element distribution and human activity impact in the past 8000 years. It was found that Al, Ba, Cr, Ga, K, Na, Sr, Ti, V, Y and Zr were lithogenic and supplied by atmospheric soil dust, while Cd, Pb, P, and Zn were recognized to anthropogenic, especially the ore exploration. The depth of samples depicted the influence degree of human activities yearly [47]. The EF profile showed that Pb, Zn, and Hg were at peak values in atmospheric in Roman age and nineteenth to twentieth centuries.
From the recent researches, it is concluded that Pb is an important anthropogenic originated element. Some reports argued that the vehicle emissions, brake lining, coal burning, plastics and rubber production, and car barriers are potential source of Pb. Meanwhile, Cu might come from vehicle brake lining, Zn from vehicle tires [51]. For the agriculture soil, Cu are usually cumulated by application of commercial fertilizers and Cu-based pesticides and fungicides [115]. Cd was related to the use of phosphate fertilizers [116]. Mineral fertilizers and animal manure may lead to elevation of Zn and Cu levels in soil.

TE apportionment in air and particles
The TEs spread through the air usually as particles. The particle smaller than 10 μm is called PM 10, while PM 2.5 stand for that smaller than 2.5 μm. It is obviously that the haze-day rate has increasing in the past decade, several researchers have reported characteristics, composition, and sources of PM 10 and PM 2.5 in some Chinese cities [17,64,117]. At the same time, PM 10 and PM 2.5 in megacities all around the world are investigated [118][119][120][121]. TEs, such as Cu, Zn, Pb, Cd, Cr, relating to the PM 2.5 and PM 10, show deleterious effects to human health. Based on the epidemiological and toxicological studies [122,123], the TEs in ambient PM 2.5 influence the severity of allergic respiratory disease and have a high cancer risk to the exposed populations [81,82].
In the source apportionment analysis, six types of main resource of ambient particular matter are commonly found: natural sources (including soil dust and sea salt), domestic fuel burning, industry, traffic, unspecified source of human origin pollution. Soil dust refer to the bare soils by local wind. Sea salt particles can be found close to the coast. Domestic fuel burning includes coal, gas fuel and wood for cooking and heating. Traffic is a complex source of PM and TEs. All the burning of fuel and diesel, wear of brake linings, clutch, and tires are source of TEs [124]. The "Unspecified sources of human origin" category mainly includes secondary particles formed from unspecified pollution sources of human origin. The reasons of the second outbreak of PM 2.5 are complex, including some chemical reaction. In fact, the reasons of the fog and haze are: (1) the accumulation of the fog and haze, namely the results of combustion, automobile exhaust, and dust effects; (2) the fog and haze particles' upward momentum-hot-air upward movement and wireless communication, namely the electromagnetic wave net sports; (3) no sustained wind. These three conditions indispensable lead to persistent fog and haze weather, and the second outbreak of PM 2.5 results from the above three conditions together. In a review for the source apportionment study, 87% of the record have traffic origin, 66% have industry origin, 45, 100, and 89% have domestic fuel burning, unspecified source of human origin, and natural sources origin, respectively [125].
In southern China, TE source in the PM 2.5 was identified using PCA technique. Three sampling sites were analyzed separately. In YL sampling site, the PC1, with Zn and Pb, were identified as the traffic source, Cu and Cd, high loading on the PC2, originated from coal and other kind of fossil fuel. In KF sampling site, Zn, Cd, and Pb were from vehicle emission and abrasion of automobile tire. Cu, high loaded on the PC2, is a tracer of fossil and other fuel combustion. In the YH site, Cu, Cd, and Pb were associated with domestic fossil fuel burning, and Zn represent brake and tire wear and other transportation processes [17]. During Chinese Spring Festival, haze may occur more frequently, and the PM 2.5 level can be elevated. In Henan province China, sources of PM 2.5 were identified by using PCA, and a model to predict PM 2.5 concentrations using multivariate linear regression was set up. The most important source was burning source, including coal combustion, fireworks, fire crackers and biomass burning, contributing 61% of all the PM 2.5. The second, third, and fourth sources were vehicle emission (27%), soil (8%), and road dust (3.28%), respectively [63].
In Costa Rica, by using the method of PMF, eight important sources of PM 2.5 and PM 10 and TEs were identified. Vehicle exhaust, containing EC, OC, SO 4 2À and certain amount of Fe, residual oil combustion, bringing Ni and V, fresh sea salt, including Cl À , Na and Mg, were the first three source. The others are crustal, or dust aerosols originated, organic carbon and sulfate, secondary sulfate, secondary nitrate, and heavy fuels [66].
A 6-year investigation of PM 2.5 levels, source and potential human risk was investigated in Canada. Secondary organic aerosol, secondary nitrate, secondary sulfate, transportation and biomass burning, contributed more than 85% to PM 2.5, the importance of which was in descent order [61].
Coal mining impact of air pollution, including suspended particles was investigated in India. The PCA and CA results suggested PC1 represent PM 10, SO 2 , PM 2.5, PM 1.0, Ni and Cu, which are originate from coal burning and active mine fire. PC2 was high loaded with NO 2 , Pb, Cd and Cr, and originated from crude oil combustion and vehicular emission. The PC3, including Fe and Mn, was mainly contributed by earth crust, wind-blown soil, and coal fly ash [60].
In the USA, brake wear, tire wear, fertilized soil, and resuspended soil were found to be important sources of copper, zinc, phosphorus, and silicon, respectively, using the method of positive matrix factorization. Zn was found strongly related to tire wear but also contributed to the Pb-rich features and soil. At the same time, the Pb-rich contributions are highly correlated with the tire wear, elevated P contributions within the fertilized soil as well as the Pb-rich feature [68].
Brinkman et al. compared the performance of PCA and PMF on the source apportionment for the particle matters. It was found that most of the PCA factors were easily distinguishable from others by sharp differences in the factor loadings. For many individual compounds, the variance was explained primarily by a single factor. In contrast, the factors obtained with PMF were more difficult to distinguish because anticipated tracer compounds for certain sources appeared in multiple PMF factors [65].

Summary of method used to identify source of contaminates
Applications and implementations of multivariate analysis/data mining, combining with geochemical method, on source apportionment of trace element as contaminates in environmental medias are increasing, with the development of techniques of big data, machine learning, and computer software. Four environmental medias, water, sediment, soil, and particles are discussed.
Four types of application can be identified for water contamination: trace the source of TEs, evaluate water quality of surface water and ground water, identify intrusion in coal mines and other scenario, and find and quantify water relationship between different bodies, such as surface-ground water relationship. The sediment and water composite a reaction system, i.e., the sediment could be origin, sink of trace elements in water, or be sink at first step, then origin again. Therefore, the system should be analyzed together. The researches on sediment are less than water, and most of articles on this topic are from China.
The most used method for the source apportionment of TEs in water and sediment is principal component analysis (PCA), probably for it's easy to use and explain. With the developing of data mining algorithm and calculation software, the application of PCA become easier and more efficient. The similar method, factor analysis (FA) is also used. The PCA and FA are both unsupervised ML method. Although having less accuracy than the supervised method, these methods are suitable for this topic.
Supervised ML methods are also used in this area, though much less than the unsupervised ML methods, and its scope of application is different. For example, decision tree is used to classify the sample types [38]. Discriminant analysis is also a supervised method, its implementation can be found, especially on the identifying water inrush source in coal mines, as the labeled data can be obtained [32,33]. In this sense, other supervised machine learning method, ANN, support vector machine, decision tree, can also be used to identify water inrush source. Usually, ANN need more data to improve predicting quality, than SVM and decision tree.
In order to combing the advantages of unsupervised and supervised machine learning methods, semi-supervised method has been introduced and implemented on this topic [52]. At present, related researches are rare, but promising reports are expected.
From the reviewed reports, it is concluded that the surface water is more contaminated by major elements, and nitrogen, which may stand for the organic contamination. The ground water is more contaminated by trace elements, As, Cr, Cd, Pb, Hg, Se, etc. The surface water may be impacted by civil and industrial activities, and the ground water may be impacted by water-rock interaction in the rock seam. The most important anthropogenic source of trace elements in the ground water is the coal and metal mines. These mines contain high content of toxic trace elements, which is stable in an anaerobic environment. Once the rock and coal are excavated, trace elements are released. Less contaminated by trace elements in the investigated surface water is not proving of safety of the surface water. Researches of sediment in rivers and lakes have found high content of anthropogenic source trace element, including As, Cr, Cd, Pb, Hg, Se, Cu, Zn, Ni, etc. The sediment and water in river and lake composite a reactive system, in which the sediment is both sink and source of the trace elements. Therefore, the source, reaction pathway in this system need thoroughly researches and regulations.
Researches on soil can roughly be divided into two large group, agriculture soil, and urban/industrial soil. Unsurprisingly, first TE source of agriculture soil is natural, and first TE source of urban/industrial soil is anthropogenic. As the impact of industrial development on environment, researches on urban/industrial soil are increasing, and carried out in a wider scale. Researches on particles have become popular because the air is easily impacted by human activities. In some countries, haze has become an important problem. As the main composition, suspended particles in air, especially PM 2.5, are the important media to transport and spread contaminates. The researches on PM 2.5 are carrying out all around the world, both developed and developing countries.
The most popular method used are PCA, FA, and positive matrix fractionation (PMF). The PMF is frequently used in the particle researches, but less in water and soil researches. In the study of soil and particle, semi-supervised ML techniques are also implemented [38]. Some researches combine the machine learning method with geochemical method, or two or more machine learning method together. For example, Petrik et al. [56] combined factor analysis and multivariate linear regression. The ANN is a tool to predict air quality based on history data, relative researches are abundant, Mclean et al. have made a thorough reviewed on this topic [87]. However, very little work has been carried out to identify TE source using ANN method.
From the reviewed reports, anthropogenic source of trace elements in soil and particle includes mainly metal element, Zn, Mn, Ni, Cu, and some other toxic elements, As, Cd, Cr, Hg, Pb, etc. The soil and particle have similar TE composite. More metal TEs are found in soil and particle than that in ground water.

Conclusions
The techniques of data mining are widely used to trace sources of TEs in water and solid matrix.
In water environment, ground water and surface water have relation in the flow network. Human activities, especially for the mining, change the natural reaction environment, releasing trace element into ground water and surface water. Then the sediment in river and lake may be contaminated and be a source to water that may release trace element again. Soil, dust, and air particles may be influenced by varies of human activities, especially in the urban and industrial area. The TE composition is different depending on the environmental media type, human activities, land use type, etc. However, some environmental concern element, As, Pb, Cd, Hg, Cr, are frequently found in water, sediment, soil, and particle, showing high mobility and contaminating potential on environment.
The unsupervised machine learning algorithm, including principal component analysis, factor analysis, positive matrix fractionation is mostly used. The PCA is used in water is to find contamination source of trace element, and sometimes water inrush in coal mines. In the air particle researches, PCA and PMF are frequently used to trace the source of PM 2.5 and PM 10, and the TEs source in the particle sources. Some supervised algorithm, including discrimination analysis, Bayesian network, artificial neural network, decision tree is used when the data are labeled.
Generally speaking, the most popular methods used to apportion the source of trace elements as contaminants are unsupervised ML techniques, especially the principal component analysis. In a wider scope, supervised ML is a big tool box for investigations and researches, which is frequently applied and implemented in the areas of science and society. The supervised ML usually gives more accuracy and robust result than the unsupervised ML. In the area of trace element apportionment, some factors constrain the implementation of supervised ML techniques, as the sources are usually not known. However, some techniques are promising to treat the issues of trace element apportionment. First, the supervised ML methods could be implemented more frequently. The unsupervised ML methods are used in the first step. With the intensive research, as some sources have been identified, the supervised ML methods could be used. For example, water inrush is sometimes a threaten in some Chinese coal mines. As the potential source of inrush can be identified, supervised ML method, discriminant analysis is used to determine the water type of inrush, then the corresponding technologies to deal with the threaten or accidents could be implemented. At this stage, some other supervised ML method could also be used. However, the discriminant analysis was mostly used. Second, semi supervised ML may be used implemented more. This method is a series of relative novel techniques. Once more data is obtained in an investigation or research, the semi-supervised ML may be used. In a sense, this method combines the unsupervised and supervised techniques in one implementation. Third, the machine learning method could be combined with geochemical method together. Two technique system have their advantages and disadvantages, the combination may achieve its maximum consequences and efficiency.