Machine Learning in Volcanology: A Review

A volcano is a complex system, and the characterization of its state at any given time is not an easy task. Monitoring data can be used to estimate the probability of an unrest and/or an eruption episode. These can include seismic, magnetic, electromagnetic, deformation, infrasonic, thermal, geochemical data or, in an ideal situation, a combination of them. Merging data of different origins is a non-trivial task, and often even extracting few relevant and information-rich parameters from a homogeneous time series is already challenging. The key to the characterization of volcanic regimes is in fact a process of data reduction that should produce a relatively small vector of features. The next step is the interpretation of the resulting features, through the recognition of similar vectors and for example, their association to a given state of the volcano. This can lead in turn to highlight possible precursors of unrests and eruptions. This final step can benefit from the application of machine learning techniques, that are able to process big data in an efficient way. Other applications of machine learning in volcanology include the analysis and classification of geological, geochemical and petrological “static” data to infer for example, the possible source and mechanism of observed deposits, the analysis of satellite imagery to quickly classify vast regions difficult to investigate on the ground or, again, to detect changes that could indicate an unrest. Moreover, the use of machine learning is gaining importance in other areas of volcanology, not only for monitoring purposes but for differentiating particular geochemical patterns, stratigraphic issues, differentiating morphological patterns of volcanic edifices, or to assess spatial distribution of volcanoes. Machine learning is helpful in the discrimination of magmatic complexes, in distinguishing tectonic settings of volcanic rocks, in the evaluation of correlations of volcanic units, being particularly helpful in tephrochronology, etc. In this chapter we will review the relevant methods and results published in the last decades using machine learning in volcanology, both with respect to the choice of the optimal feature vectors and to their subsequent classification, taking into account both the unsupervised and the supervised approaches.


Introduction
Pyroclastic density currents, debris flow avalanches, lahars, ash falls can affect dramatically the life of people living close to volcanoes, and other volcanic products such as lava flows can severely affect properties and infrastructures. Several volcanoes lie close to highly populated areas and the impact of their eruptions could be economically very strong. Stochastic forecasts of volcanic eruptions are difficult [1,2], but deterministic forecasts (i.e., specifying when, where, how an eruption will occur) are even harder. Many volcanoes are monitored by observatories that try to estimate at least the probability of the different hazardous volcanic events [3]. Different time series can be monitored and hopefully used for forecasting, including seismic data [4], geomagnetic and electromagnetic data [5], geochemical data [6], deformation data [7], infrasonic data [8], gas data [9], thermal data from satellite [10] and from the ground [11]. Whenever possible, a multiparametric approach is always advisable. For instance, at Merapi volcano, seismic, satellite radar, ground geodetic and geochemical data were efficiently integrated to study the major 2010 eruption [12]; a multiparametric approach is essential to understand shallow processes such as the ones seen at geothermal systems like e.g., Dallol in Ethiopia [13]. Although many time series may be available, seismic data remain always at the heart of any monitoring system, and should always include the analysis of continuous volcanic tremor [14]; tremor has in fact a great potential [15] due to its persistence and memory [1,2] and its sensitivity to external triggering such as regional tectonic events [16] or Earth tides [17]. Moreover, its time evolution can be indicative of variations in other parameters, such as gas flux [18]. Other information-rich time series can be built looking at the time evolution of the number of the different discrete volcano-seismic events that can be recorded on a volcano. These include volcano-tectonic (VT) earthquakes, rockfall events, long-period (LP) and very-long-period (VLP) events, explosions, etc. Counting the overall number of events is not enough: one has to detect them and classify them, because they are linked to different processes, as detailed below. For this reason it is important to generate automatically different time series for each type of volcano-seismic event.
VT can be described as "normal" earthquakes which take place in a volcanic environment and can indicate magma movement [19,20]. LP events have a great potential for forecasting [21]. Their debated interpretation involves the repeated expansion and compression of sub-horizontal cracks filled with steam or other ash-laden gas [22], stick-slip magma motion [23], fluid-driven flow [24], eddy shedding, turbulent slug flow, soda bottle analogues [25], deformation acceleration of solidified domes [26] and slow ruptures [27]. Explosion quakes are generated by sudden magma, ash, and gas extrusion in an explosive event, often associated to VLP events [28]. In many papers also "Tremor episodes" (TRE events) are described and counted, usually associated to magma degassing [20]. However, a volcano with any activity produces a continuous "tremor" which detectability only depends on the seismic instrumentation sensitivity [29,30]. So, the class "TRE" should be better defined as "tremor episode that exceeds the detection limits". Of course, at volcanoes we can also record natural but non-volcanic seismic signals such as far tectonic earthquakes, far explosions, etc., and also anthropogenic signals e.g., due to industries, ground vehicles, helicopters used for monitoring, etc.
Most volcano observatories rely on manual classification and counting of such seismic events, which suffers from human subjectivity and can become unfeasible during an unrest or a seismic crisis [31,32]. For this reason, manual classification should be substituted by an automated processing, and here is where machine learning (ML) comes into place. The same reasoning applies of course also to the automated processing of other monitoring time series, such as deformation, gas and water geochemistry, etc. Moreover, ML in volcanology is not restricted to monitoring active volcanoes but has demonstrated to be useful also when dealing with other large datasets. Examples include correlating volcanic units in general e.g., [33], of tephra e.g., [34,35] and ignimbrites e.g., [36], a task which may become very difficult especially when many deposits of similar ages and geochemical and Factorization (NMF) [43], Singular Value Decomposition [44], Principal Component Analysis (PCA) [45] and Auto-encoders [46]. Linear Discriminant Analysis (LDA) [47] uses the training samples to estimate the between-class and within-class scatter matrices, and then employs the Fisher criterion to obtain the projection matrix for feature extraction (or feature reduction).
In supervised learning, the dataset is a collection of example couples of the type (data, label) ( ) Each element i x is called a feature vector and has a companion label i y . In the supervised learning approach the dataset is used to derive a model that takes a feature vector as input and outputs a label that should describe it. For example, the feature vector of volcano-seismic data could contain several amplitude-based, spectral-based, shape-based or dynamical parameters and the label to be assigned could be one of those described above, i.e., VT, LP, VLP. In a volcanic geochemical example, feature vectors could contain major elements weight percentages, and labels the corresponding rock type. The reliability of the labels is often the most critical issue of the setup of a supervised ML classification scheme. Labels should therefore be assigned carefully by experts. In general, it is much better to have relatively few training events with reliable labels than to have many more, but not so reliable, labeled examples.
In unsupervised learning, the dataset is a collection of examples without any labeling, i.e., containing only the data { } 1..
As in the previous case, each i x is a feature vector, and the goal is to create a model that maps a feature vector x into a value (or another vector) that can help solving a problem. Typical examples are all the clustering procedures, where the output is the cluster number to which each datum belongs. The choice of the best features to use is a difficult one, and several techniques of Unsupervised Feature Selection were proposed, with the capability of identifying and selecting relevant features in unlabeled data [48]. Unsupervised outlier detection methods [49] can also be used, where the output indicates if a given feature vector is likely to describe a "normal" or "anomalous" member of the dataset.
The semi-supervised learning approach stands somehow in the middle, and the dataset contains both labeled (usually a few) and unlabeled (usually many more) feature vectors. The basic idea is similar to supervised learning, but with the possibility to exploit also the presence of (many more) unlabeled examples in the training phase.
In reinforcement learning, the machine is "embedded" in an environment, which state is again described by a feature vector. In each state the machine can execute actions, which produce different rewards and can cause an environmental state transition. The goal in this case is to learn a policy, i.e., a function or model that takes the feature vector as input and outputs an optimal action to execute in that state. The action is optimal if it maximizes the expected average reward. We can also say that reinforcement learning is a behavioral learning model. The algorithm receives feedback from the data analysis, guiding the user to the best outcome. Here the main point is that the system is not trained with a sample dataset but learns through trial and error. Therefore, a sequence of successful decisions will result in that process being reinforced, because it best solves the problem at hand. Problems that can be tackled with this approach are the ones where decision making is sequential, and the goal is long-term, such as game playing, robotics, resource management, or logistics. Time is therefore explicitly used here, contrary to other approaches, in which in most of the cases data items are analyzed one by one without taking into account the time order in which they arrive.
In some domains (and volcanology is a good example) training data are scarce. In this case we can profit from knowledge acquired in another domain using DOI: http://dx.doi.org /10.5772/intechopen.94217 techniques known as Transfer Learning (TL) [50]. The basic idea here is to train a model in one domain with abundant data (original domain) and then use it as a pretrained model in a different domain (with less data). There is a successive fine-tuning phase using domain-specific available data (in the target domain). This approach was applied for instance at Volcán de Fuego de Colima (Mexico) [51], Mount St. Helens (USA) and Bezymianny (Russia) [52].
Among the computer languages that are most used for implementing ML techniques we can cite Python [53], R [54], Java [55], Javascript [56], Julia [57] and Scala [58]. Many dedicated, open source libraries are available for each of them, and many computer codes, also specialized for volcanic and geophysical data, can be found in open access repositories such as GitHub [59].
CA (Figure 2a) is an unsupervised learning approach aimed at grouping similar data while separating different ones, where similarity is measured quantitatively using a distance function in the space of feature vectors. The clustering algorithms can be divided into hierarchical and non-hierarchical. In the former a tree-like structure is built to represent the relations between clusters, while in the latter new clusters are formed by merging or splitting existing ones without following a treelike structure but just grouping the data in order to maximize or minimize some evaluation criteria. CA includes a vast class of algorithms, including e.g., K-means, K-medians, Mean-shift, DBSCAN, Expectation-Maximization (EM), Clustering using Gaussian Mixture Models (GMM), Agglomerative Hierarchical, Affinity Propagation, Spectral Clustering, Ward, Birch, etc. Most of these methods are described and implemented in the open-source Python package scikit-learn [73]. The use of six different unsupervised, clustering-based methods to classify volcano seismic events was explored at Cotopaxi Volcano [32]. One of the most difficult issues is the choice of the number of clusters into which the data should be divided; this number in most of the cases has in fact to be fixed a priori before running the code. Several techniques exist in order to help with this choice, such as elbow, silhouette, gap statistics, heuristics, etc. Many of them are described and included in the R package NbClust [74]. Problems arise when the estimates that each of them provides are contradictory.
Another approach to unsupervised classification is SOM (Figure 2b) or Kohonen maps [75,76], a type of ANN trained to produce a low dimensional, usually 2D, discretized representation of the feature vector space. The training is based on competitive and collaboration learning, using a neighborhood function to preserve the input topological properties.
A very common type of ANN, often used for supervised classification, is MLP, which consists of at least three layers of nodes (Figure 2c): an input layer, (at least) one hidden layer and an output layer. Nodes use nonlinear activation functions and are trained through the backpropagation mechanism. If the number of hidden layers of an ANN becomes very high, we talk of Deep Neural Networks (DNN), which are also used mainly in a supervised fashion. Among DNN, the CNN (Figure 2d) contain at least some convolutional layers, that convolve their inputs with a multiplication or other dot product. The activation function in the case of CNN is commonly a rectified linear unit (ReLU), and there are also pooling layers, fully connected layers and normalization layers.
A RNN is a type of ANN with a feedback loop (Figure 3a), in which neuron outputs can also be used as neuron inputs in the same layer, allowing to maintain some information during the training process. Long Short Term Memory networks (LSTM) are a subset of RNN, capable of learning long-term dependencies [77] and better remember information for long periods of time. RNN can be used for both supervised and unsupervised learning. Logistic regression (LR) (Figure 3b) is a supervised generalized linear model, i.e., the classification (probability) dependence on the features is linear [78]. In order to avoid the problems linked to high dimensionality of the data, techniques such as the Least Absolute Shrinkage and Selection Operator (LASSO) can be applied to reduce the number of dimensions of the feature vectors which are input to LR [79]. SVM (Figure 3c) constitute a supervised statistical learning framework [80]. It is most commonly used as a non-probabilistic binary classifier. Examples are seen as points in space, and the aim is to separate categories by a gap that is as wide as possible. Unknown samples are then assigned to a category based on the side of the gap on which they fall. In order to perform a non-linear classification, data are mapped into high-dimensional feature spaces using suitable kernel functions.

Schematic illustration of some of the ML techniques described in the text. (a) Recurrent neural network (b) logistic regression (c) support vector machine (d) random forest (e) hidden Markov model.
Sparse Multinomial Logistic Regression (SMLR) is a class of supervised methods for learning sparse classifiers that incorporate weighted sums of basis functions with sparsity-promoting priors encouraging the weight estimates to be either significantly large or exactly zero [81]. The sparsity concept is similar to the one at the base of Non-negative Matrix Factorization (NMF) [82]. The sparsity-promoting priors result in an automatic feature selection, enabling to somehow avoid the so-called "curse of dimensionality". So, sparsity in the kernel basis functions and automatic feature selection can be achieved at the same time [83]. SMLR methods control the capacity of the learned classifier by minimizing the number of basis functions used, resulting in better generalization. There are fast algorithms for SMLR that scale favorably in both the number of training samples and the feature dimensionality, making them applicable even to large data sets in high-dimensional feature spaces.
A Decision Tree (DT) is an acyclic graph. At each branching node, a specific feature i x is examined. The left or right branch is followed depending on the value of i x in relation to a given threshold. A class is assigned to each datum when a leaf node is reached. As usual, a DT can be learned from labeled data, using different strategies. In the DT class we can mention Best First Decision Tree (BFT), Functional Tree (FT), J48 Decision Tree (J48DT), Naïve Bayes Tree (NBT) and Reduced Error Pruning Trees (REPT). Ensemble learning techniques such as Random SubSpace (RSS) can be used to combine the results of the different trees [84].
The Boosting concept, a kind of ensemble meta-algorithm mostly (but not only) associated to supervised learning, uses original training data to create iteratively multiple models by using a weak learner. Each model would be different from the previous one as the weak learners try to "fix" the errors made by previous models. An ensemble model will then combine the results of the different weak models. On the other side, Bootstrap aggregating, also called by the contracted name Bagging, consists of creating many "almost-copies" of the training data (each copy is slightly different from the others) and then apply a weak learner to each copy and finally combine the results. A popular and effective algorithm based on bagging is Random Forest (RF). Random Forest (Figure 3d) is different from the standard bagging in just one way. At each learning step, a random subset of the features is chosen; this helps to minimize correlation of the trees, as correlated predictors are not efficient in improving classification accuracy. Particular attention has to be taken in order to best choose the number of trees and the size of the random feature subsets.
A Hidden Markov Model (HMM) (Figure 2e) is a statistical model in which the system being modeled is assumed to be a Markov process. It describes a sequence of possible events for which the probability of each event depends only on the state occupied in the previous event. The states are unobservable ("hidden") but at each state the Model emits a "message" which depends probabilistically on the current state. Applications are wide in scope, from reinforcement learning to temporal pattern recognition, and the approach works well when time is important; speech [85], handwriting and gesture recognition are then typical fields of applications, but also volcano seismology [69,86].

Applications to seismo-volcanic data
Eruptions are usually preceded by some kind of change in seismicity, making seismic data one of the key dataset in any attempt to forecast volcanic activity [4]. As we mentioned before, manual detection and classification of discrete events can be very time consuming, up to becoming unfeasible during a volcanic crisis. DOI: http://dx.doi.org /10.5772/intechopen.94217 An automatic classification procedure becomes therefore highly valuable, also as a first step towards forecasting techniques such as material Failure Forecast Method (FFM) [87,88]. Feature vectors should be built in order to provide most information about the source, minimizing e.g., path and site effects. In many cases features can be independent from a specific physical model describing a phenomenon. This allows ML to work well even when there is no scientific agreement on the generation of a given seismic signal. A good example in volcano seismology is given by the LP events. Standardizing data, making them independent from unwanted variables is also in general a convenient approach [31]. Time-domain and spectral-based amplitudes, spectral phases, auto-and cross-correlations, statistical and dynamical parameters have been considered as the output of data reduction procedures that can be included into feature vectors [14]. In the literature, these have included linear predictor coding for spectrograms [66], wavelet transforms [89], spectral autocorrelation functions [90], statistical and cepstral coefficients [91]. Extracted feature vectors become then the input to one or another ML method.
CA is probably the most used class of unsupervised techniques and the applications to volcano seismology follow this general rule. Spectral clustering was applied e.g., to seismic data of Piton de la Fournaise [60]. The fact that e.g., LP seismic signals can be clustered into families indicates that the family members are very similar to each other. The existence of similar events implies similar location and similar source process, i.e., it means the presence of a source that repeats over time in an almost identical way. Clustering data after some kind of normalization forces CA algorithms to look for similar shapes, independently of size. If significant variations in amplitude are then seen within families, this can indicate that the source processes of these events are not only repeatable but also scalable in size, as observed e.g., at Soufrière Hills Volcano, Montserrat [92] or at Irazú, Costa Rica [93]. The similarity of events in the different classes can then be used to detect other events, e.g., for the purpose of stacking them and obtain more accurate phase arrivals; this was done e.g., at Kanlaon, Philippines [94]. For this purpose, an efficient open-source package is available, called Repeating Earthquake Detector in Python (REDPy) [95].
In volcano-seismology SOM were applied e.g., to Raoul Island, New Zealand [61]. A hierarchical clustering was applied to results of SOM tremor analysis at Ruapehu [62] and Tongariro [96] in New Zealand, using the Scilab environment. A similar combined approach was applied in Matlab to Etna volcanic tremor [97]. Several geometries of SOM were used, with rectangular or hexagonal nearest neighbors cells, planar, toroidal or spherical maps, etc. [61]. The classic ANN/MLP approach was applied e.g., to seismic data recorded at Vesuvius [66], Stromboli [98], Etna [99], while DNN architectures were applied e.g., to Volcán de Fuego, Colima [100]. The use of genetic algorithms for the optimization of the MLP configuration was proposed for the analysis of seismic data of Villarrica, Chile [101]. CNN were applied e.g., to Llaima Volcano (Chile) seismic data, comparing the results to other methods of classification [102]. RNNs were applied, together with other methods, to classify signals of Deception Island Volcano, Antarctica [68]. The architectures were trained with data recorded in 1995-2002 and models were tested on data recorded in 2016-2017, showing good generalization accuracy.
Supervised LR models have been applied in the estimation of landslide susceptibility [103] and to volcano seismic data to estimate the ending date of an eruption at Telica (Nicaragua) and Nevado del Ruiz (Colombia) [104]. SVM were applied many times to volcano seismology e.g., to classify volcanic signals recorded at Llaima, Chile [105] and Ubinas, Peru [106]. Multinomial Logistic Regression was used, together with other methods, to evaluate the feasibility of earthquake prediction using 30 years of historical data in Indonesia, also at volcanoes [107].
RF was applied to the discrimination of rockfalls and VT recorded at Piton de la Fournaise in 2009-2011 and 2014-2015. 60 features were used, and excellent results were obtained. However, a RF trained with 2009-2011 data did not perform well on data recorded in 2014-2015, demonstrating how difficult it is to generalize models even at the same volcano [108]. RF, together with other methods, was recently used on volcano seismic data with the specific purpose to determine when an eruption has ended [104], a problem which is far from being trivial. RF was also used to derive ensemble mean decision tree predictions of sudden steam-driven eruptions at Whakaari (New Zealand) [109].
Most of the methods described so far try to classify discrete seismic events that were already extracted from the continuous stream, i.e., already characterized by a given start and end. There are therefore in general two separated phases: detection and classification [106]. Continuous HMM on the other side are able to process continuous data and can therefore extract and classify in a single, potential realtime, step. HMM are finite-state machines and model sequential patterns where time direction is an essential information. This is typical of (volcano) seismic data. For instance, P waves always arrive before S waves. HMM-based volcanic seismic data classifiers have therefore been used by many authors [87,[110][111][112][113]. HMM are also used routinely in some volcano observatories e.g., at Colima and Popocatepetl in Mexico [71]. Etna seismic data was processed by HMM applied to characters generated by the Symbolic Aggregate approXimation (SAX) which maps seismic data into symbols of a given alphabet [114]. HMM can be also combined with standardization procedures such as Empirical Mode Decomposition (EMD) when classifying volcano seismic data [31].
Another characteristic common to many of the applications published in the literature is the fact that feature vectors are extracted from data recorded at a single station. There are relatively few attempts to build multi-station classification schemes. At Piton de la Fournaise a system based on RF was implemented [115]. At the same volcano, a multi-station approach was used to classify tremor measurements and identify fundamental frequencies of the tremor associated to different eruptive behavior [60]. A scalable multi-station, multi-channel classifier, using also the empirical mode decomposition (EMD) first proposed by [31] was applied to Ubinas volcano (Peru). The principal component analysis is used to reduce the dimensionality of the feature vector and a supervised classification is carried out using various methods, with SVM obtaining the best performance [116]. Of course, with a multi-station approach particular care has to be taken in order to build a system which is robust with respect to the loss of one or more seismic stations due to volcanic activity or technical failures.
Open source software and open access papers are luckily becoming more and more common. If we consider the processing and classification of volcano seismic data, several tools are now available for free download and use, especially within the Python environment. Among the most popular, we can cite ObsPy [117] and Msnoise [118], with which researchers and observatories can easily process big quantities of continuous seismic data. Once these tools have produced suitable feature vectors, we can look for open source software to implement the different ML approaches described in this contribution. Many generic ML libraries are available e.g., on GitHub [59] but very few are dedicated specifically to the classification of volcano seismic data. Among these, we can cite the recent package Python Interface for the Classification of Seismic Signals (PICOSS) [119]. It is a graphical, modular open source software for detection, segmentation and classification of seismic data. Modules are independent and adaptable. The classification is currently based on two modules that use Frequency Index analysis [120] or a multi-volcano pre-trained neural network, in a transfer learning fashion [52]. The concept of a multi-volcano DOI: http://dx.doi.org /10.5772/intechopen.94217 recognizer is also at the core of the EU-funded VULCAN.ears project [31,121]. The aim is to build an automatic Volcano Seismic Recognition (VSR) system, conceptually supervised (as it is based on HMM) but practically unsupervised, because once it is trained on a number of volcanoes with labeled sample data, it can be used on volcanoes without any previous data in an unsupervised fashion. The idea is in fact to build robust models trained on many datasets recorded by different teams on different volcanoes, and to integrate these models on the routinely used monitoring system of any volcano observatory. Also in this case, the open source software is made freely available; this includes a command interface called PyVERSO [122] based on HTK, a speech recognition HMM toolkit [123], a graphical interface called geoStudio and a script called liveVSR, able to process real-time data downloaded from any online seismic data server [124], together with some pre-trained ML models [125].
As we mentioned before, in order to train supervised models for classifying seismic events, few events with reliable labels are better than many unreliably labeled examples. Just to give a rough idea, 20 labeled events per class is a good starting point, but a minimum of 50 labeled events per class is recommended. Labelling discrete events is enough for many methods, but for approaches like HMM, where the concept is to run the classification on continuous data, it is essential to have a sufficient number of continuously labeled time periods, in order to "show" the classifier enough examples of transition from tremor to a discrete event, and then back to tremor. It is important to have many examples also of "garbage" events, i.e., events we are not interested in, so that the classifier can recognize and discard them. Finally, it is advisable to have a wide variability of events within each given class rather than having many very similar events. There is not yet an agreement on a single file format to store these labels. As speech recognition is much older and more developed than seismic recognition, it is suggested to adopt standard labelling formats of that domain, i.e., the transcription MLF files, which are normal text files that include for each event the start time, the end time and of course the label. These files can be created manually with a simple text editor, or by using a program with a GUI, such as geoStudio [124] or Seismo_volcanalysis [126]. Other graphical software packages like SWARM [127] use other formats to store the labels, such as CSV, but it is always possible to build scripts that convert the resulting label files into MLF format, which remains the recommended one.

Applications of machine learning to geochemical data
ML applications to geochemical data of volcanoes are increasing in the last years, although most of them are limited to the use of cluster analysis. CA has been used for example to identify and quantify mixing processes using the chemistry of minerals [128], also for the study of volcanic aquifers [129,130] or to differentiate magmatic systems e.g., [131]. Platforms used to carry out these analyses include the Statistical Toolbox in Matlab [132], or the R platform [54]; some geochemical software made in this last platform include the CA as the GCDkit [33]. In most ML analyses on geochemical samples it is common to use whole rock major elements and selected trace elements; some applications also include isotopic ratios. Many ML applications to geochemical data use more than one technique, frequently combining both unsupervised and supervised approaches.
A combination of SVM, RF and SMLR approaches were used by [37] to account for variations of geochemical composition of rocks from eight different tectonic settings. The authors note that SVM used to discriminate tectonic settings as used by [34] is a powerful tool. The RF approach is shown to have the advantage, with respect to SVM, of providing the importance of each feature during discrimination. The weakness of applying the RF for tectonic setting discrimination is that the evaluation based only on a majority vote of multiple decision trees often makes the obtained quantitative geochemical interpretation of these elements and isotopic ratios difficult. The authors suggest that the best quantitative discriminant is that of SMLR, as it allows to assign to each sample a probability of belonging to a given group (tectonic setting in this case), with still the possibility of identifying the importance of each feature. This tool is a notable step forward in the discrimination of the geochemical signature of the different tectonic settings, which is commonly assessed based on binary or ternary diagrams e.g., [133,134] which are useful with many samples but are not able to differentiate a tectonic setting where a complex evolution of magmas has occurred. In the last decade multielement variation diagrams were proposed e.g., [135] and also the use of Decision Trees e.g., [136] or LDA e.g., [137] to accurately assign a tectonic setting based on rock geochemistry. Based on rock sample geochemistry, [37] show that a set of 17 elements and isotopic ratios is needed to clearly identify the tectonic setting. Two new discriminant functions were recently proposed to discriminate the tectonic settings of mid-ocean ridge (MOR) and oceanic plateau (OP). 10 datasets (original concentrations as well as isometric log-ratio transformed variables; all 10 major elements as well as all 10 major and 6 trace elements) were used to evaluate the quality of discrimination from LDA and canonical analysis [138].
The software package Compositional Data Package (CoDaPack) [139] and a combination of unsupervised (CA) and supervised (LDA) learning approaches was used by [36] to identify compositional variation of ignimbrite magmas in the Central Andes, trying to use these methods as a tool for ignimbrite correlation. They have used the Statistica software [140] for both CA and LDA.
Correlating tephra and identifying their volcanic sources is a very difficult task, especially in areas where several volcanoes had explosive eruptions in a relatively short period of time. This is particularly challenging when volcanoes have similar geochemical and petrographic compositions. Electron microprobe analysis of glass compositions and whole-rock geochemical analyses are used frequently to make these correlations. However, correlations may not be so accurate when using only geochemical tools that may mask diagnostic variability; sometimes one of the most important advantages of ML in this regard is the speed at which correlations can be made, rather than the accuracy [35]. Other contributions however demonstrate how ML techniques can make these correlations also accurate. Some highly accurate results of ML techniques applied to tephra correlation include those of LDA [141,142] and SVM e.g., [143]; however, SVM may fail in specific cases and for the case study of tephra from Alaska volcanoes, the combination of ANN and RF are the best ML techniques to apply [35]. The authors use the R software [54] to apply these methods, and they underline the advantage of producing probabilistic outputs.
SOM was used as an unsupervised neural network approach to analyze geochemical data of Ischia, Vesuvius and Campi Flegrei [144]. The advantage of this method is that there is no need of previous knowledge of geochemical or petrological characteristics and that it allows the use of large databases with large number of variables. The SOM toolbox for Matlab [132] was used by [144] to perform two tests, the first based on major elements and selected trace elements to find similar evolution processes, the second to investigate the magmatic source, so a vector containing a selection of ratios between major and trace elements was adopted. One of the enhancements of this method is that the resulting clusters permitted to differentiate rock samples that were only comparably distinguished by 2D diagrams of isotopic ratios; in other words, similar results were obtained with the limited availability of less expensive geochemical data. DOI: http://dx.doi.org /10.5772/intechopen.94217 One of the applications of ML techniques that maybe extremely useful in geochemistry is the apparent possibility of predicting the concentration of unknown elements if a large number of data of other elements is known. A combination of ML techniques was used by [38] to predict Rare Earth Elements (REE) concentrations on Ocean Island Basalts (OIB) using RF. They used 1283 analyses of which 80% were used for training and the remaining 20% to validate the results. They found good estimations only in the Light Rare Earth Elements (LREE), suggesting that the results may be improved by using a larger set of input data for training. One possible solution may be the use of not only major elements for training but also of other trace elements obtained through the same analytical method of major elements.
The origin of the volcanoes in Northeast China, analyzed by RF and DNN using the full chemical compositional data, was associated to the Pacific slab, subducting at Japan, reaching ~600-km depth under eastern China, and extending horizontally up to Mongolia. The boundary between volcanoes triggered by fluids and melts from the slab and those not related to it was located at the westernmost edge of the deeply buried Pacific slab [145].
As highlighted by [143] ML methods require the integration with other techniques such as fieldwork, petrographic observations and classic geochemical studies to obtain a clearer picture of the investigated problem. While in other fields, it is relatively easy (and cheap) to acquire big amounts of data (hundreds or more), this is not the case for geochemistry. However, we underline that the application of ML techniques to the geochemistry of volcanic rocks does need a minimum dataset size. In the literature a set of 250 analyses is described as sufficiently large amount of data but, as usual, one can try using the available data (often even less than 50) but thousands of examples would definitely improve the results.

Applications of machine learning to other volcanological data
ML appears more and more often in volcanology literature, and specific fields of application span now also other sub-disciplines.
Mount Erebus in Antarctica has a persistent lava lake showing Strombolian activity, but its location is definitely remote. Therefore, automatic methods to detect these explosions are highly needed. A CNN was trained using infrared images captured from the crater rim and "labeled" with the help of accompanying seismic data, which was not used anymore during the subsequent automatic detection [146].
Clast morphology is a fundamental tool also for studies concerning volcanic textures. Texture analysis of clasts provides in particular information about genesis, transport and depositional processes. Here, ML has still to be developed fully but e.g., the application of preprocessing techniques such as the Radon transform can be a first step towards an efficient definition of feature vectors to be used for classification, as shown e.g., at Colima volcano [147].
The Museum of Mineralogy, Petrography and Volcanology of the University of Catania implemented a communication system based on the visitor's personal experience to learn by playing. There is a web application called I-PETER: Interactive Platform to Experience Tours and Education on the Rocks. This platform includes a labeled dataset of images of rocks and minerals to be used also for petrological investigations based on ML [148].
Satellite remote sensing technology is increasingly used for monitoring the surface of the Earth in general, and volcanoes in particular, especially in areas where ground monitoring is scarce or completely missing. For instance, in Latin America 202 out of 319 Holocene volcanoes did not have seismic, deformation or gas monitoring in 2013 [7]. A complex-valued CNN was proposed to extract areas with land shapes similar to given samples in interferometric synthetic aperture radar (InSAR), a technique widely applied in volcano monitoring. An application was presented grouping similar small volcanoes in Japan [149]. InSAR measurements have great potential for volcano monitoring, especially where images are freely available. ML methods can be used for the initial processing of single satellite data. Processing of potential unrest areas can then fully exploit integrated multi-disciplinary, multisatellite datasets [7]. The Copernicus Programme of the European Space Agency (ESA) and the European Union (EU) has recently contributed by producing the Sentinel-2 multispectral satellites, able to provide high resolution satellite data for disaster monitoring, as well as complementing previous satellite images like Landsat. The free access policy also promotes an increasing use of Sentinel-2 data, which is often processed by ML techniques such as SVM and RF [150]. A transfer learning strategy was applied to ground deformation in Sentinel-1 data [151] and a range of pretrained networks was tested, finding that AlexNet [152] is best suited to this task. The positive results were checked by a researcher and fed back for model updating.
Debris flow events are one of the most widespread and dangerous natural processes not only on volcanoes but more in general in mountainous environments. A methodology was recently proposed [154] that combines the results of deterministic and heuristic/probabilistic models for susceptibility assessment. RF models are extensively used to represent the heuristic/probabilistic component of the modeling. The case study presented is given by the Changbai Shan volcano, China [154].
Mapping lava flows from satellite is another important remote sensing application. RF was applied to 20 individual flows and 8 groups of flows of similar age using a Landsat 8 image and a DEM of Nyamuragira (Congo) with 30 m resolution. Despite spectral similarity, lava flows of contrasting age can be well discriminated and mapped by means of image classification [155].
The hazard related to landslides at volcanoes is also significant. DNN models were proposed for landslide susceptibility assessment in Viet Nam, showing considerable better performance with respect to other ML methods such as MLP, SVM, DT and RF [156]. The use of DNN approach could be therefore an interesting approach for the landslide susceptibility mapping of active volcanoes.
Muon imaging has been successfully used by geophysicists to investigate the internal structure of volcanoes, for example at Etna (Italy) [157]. Muon imaging is essentially an inverse problem and it can profit from the application of ML techniques, such as ANN and CA [158].
Combinations of supervised and unsupervised ML techniques have been used to map volcanoes also on other planets. A ML paradigm was designed for the identification of volcanoes on Venus [159]. Other studies have used topographic data, such as DEM and associated derivatives obtained from orbital images, to detect and classify manually labeled Martian landforms including volcanoes [160]. DOI: http://dx.doi.org/10.5772/intechopen.94217

Conclusions
ML techniques will have an increasing impact on how we study and model volcanoes in all their aspects, how we monitor them and how we evaluate their hazards, both in the short and in the long term. The increasing number of monitoring equipment installed on volcanoes on one side provides more and more data, on the other often causes their real time processing unfeasible especially when most needed i.e., during unrest and eruptions. Here ML will show its best usefulness, as it can provide the perfect tools to sift through big data to identify subtle patterns that could indicate unrest, hopefully well before eruptions. One important issue is the one of generalization. We must go towards the construction of ML models that can be applied on different volcanoes, for instance when previous data is not available for training specific models. The concepts of transfer learning can be important here.
The routine use of ML tools at the different volcano observatories should be promoted by providing easy installation procedures and easy integration into existing monitoring systems. Open source software should be always chosen whenever possible. On the other hand, observatories should provide good open training data to ML developers, researchers and data scientists in order to improve the models in a virtuous circle. An easy availability of open access data, both from the ground and from satellites should be exploited for building reliable training sets in the different fields of volcanology. This will allow "scientific competition" between research groups using different ML approaches and make a direct comparison of results easier, like it is common in other disciplines where "standard" training datasets are available for download to everybody.