Prediction of Relative Humidity in a High Elevated Basin of Western Karakoram by Using Different Machine Learning Models

Accurate and reliable prediction of relative humidity is of great importance in all fields concerning global climate change. The current study has employed Multivariate Adaptive Regression Spline (MARS) and M5 Tree (M5T) models to predict the relative humidity in the Hunza River basin, Pakistan. Both the models provided the best prediction for the input scenario S6 (RHt-1, RHt-2, RHt-3, Tt-1, Tt-2, Tt-3). The statistical analysis displayed that the MARS model provided a better prediction of relative humidity as compared to M5T at all meteorological stations, especially, at Ziarat followed by Khunjerab and Naltar. The values of root mean square error (RMSE), mean absolute error (MAE), and coefficient of determination (R 2 ) were (5.98%, 5.43%, and 0.808) for Khunjerab; (6.58%, 5.08%, and 0.806) for Naltar; and (5.86%, 4.97%, 0.815) for Ziarat during the testing of MARS model whereas, the values were (6.14%, 5.56%, and 0.772) for Khunjerab; (6.19%, 5.58% and 0.762) for Naltar and (6.08%, 5.46%, 0.783) for Ziarat during the testing of M5T model. Both the models performed slightly better in training as compared to the testing stage. The current study encourages future research to be conducted at high altitude basins for the prediction of other meteorological variables using machine learning tools.


Introduction
The relative humidity is defined as the amount of water vapor in the air in comparison with the full saturation [1,2]. Being the important indicator of precipitation forecasting, its prediction plays a significant part in improving the accuracy of weather forecasting [3]. The relative humidity changes with respect to change in saturated vapor pressure which further depends on wind speed, solar radiation, pressure, temperature, and moisture content in the air [1]. The relative humidity is a function of temperature and is regarded as a sensitive parameter in the field of science [4]. Relative humidity plays a vital role in plant growth, agricultural and industrial production and in the prevention and control of air pollution [5]; economic stability of a region, water systems and also in managing renewable and solar energy systems [1,6], weather and climate [7,8]. Moreover, it has also an impact on ozone concentration and adaptive thermal comfort [9]. Keeping in view the importance of relative humidity, the research on its prediction is increasingly important [7].
The relative humidity is an important aspect of the hydrological phase [8] and has a role in alpine hydrology, especially, in a cold and dry climate; any change in temperature and humidity causes larger variations in the ablation of glaciers [10]. The warm environment glaciers are subjected to be influenced more by the change in relative humidity. Few other studies e.g. [11][12][13] also observed that tropical glaciers are sensitive to subtle changes in relative humidity, precipitation, and cloudiness. Relative humidity and clouds play an important role in the energy balance of glaciers by controlling the number of outgoing longwave radiation. Moreover, relative humidity and wind speed influence the turbulent latent heat flux which supplies all energy for sublimation and thus they indirectly control the equilibrium line altitude (ELA) [14]. Another study conducted by [15] observed that relative humidity has an effect on evaporation and there is an inverse relation between them. Evaporation further controls the water balance of closed lakes in hilly areas and evapotranspiration, especially, in irrigated agricultural areas.
Regardless of relative humidity is an important component of hydrology, meteorology, and climate, only a few studies are available for its prediction. A study conducted by [1] used artificial neural networks (ANNs) and genetic expression programming (GEP) models for the prediction of relative humidity as a function of three meteorological variables: wind speed, temperature, and pressure in two Californian gauging stations. They observed that both the models can successfully predict one-year relative humidity data into the future. Another study done by [5] predicted relative humidity by establishing time series models such as Extreme Gradient Boosting (XGBoost), Seasonal Auto-Regressive Integrated Moving Average (SARIMA), and Holt-Winters (HW). The XGBoost was found more accurate because of its robust capability to resist a fitting. The study conducted by [3] found that the performance of an autoregressive integrated moving average (ARIMA) model is better than the Long Short-Term Memory (LSTM) Network for the prediction of relative humidity. On contrary, [8] observed that the LSTM network is capable of predicting complex univariate relative humidity time series with robust no-stationarity. However, Least Square Support Vector Machine (LSSVM) and Adaptive Network-Based Fuzzy Inference System (ANFIS) models were used by [2] for prediction of relative humidity in terms of dry bulb temperature and wet bulb depression and found satisfactory.
Another study conducted by [16] proposed four ANNs models to predict the relative humidity and temperature in a swine livestock warehouse located in Puerto Gaitan-Meta. They observed that the models used in the study are suitable for the prediction of humidity in barns not equipped with humidity sensors. However, [17] used an improved backpropagation (BP) neural network for the prediction of indoor relative humidity and temperature every 10 min and 6-72 hours in advance based on a cloud database in Chongqing, China. Both temperature and humidity predictions have a strong correlation with the observed data. Similarly, another study conducted by [18] used BP neural network for the prediction of one day ahead mean air temperature and relative humidity of greenhouse located in the subhumid sub-tropical regions of India. The results displayed that the BP neural network model provided the best prediction for inside temperature and relative humidity. However, a study done by [19] used daily minimum air temperature (Tn) downscaled from INMCM4 general circulation model (GCM) to predict the relative humidity for climate change studies but relative humidity predictions were poor in few months especially in March, July, August, and October. Moreover, a study conducted by [20] proposed a Functional Link Neural Network (FLNN) which comprises of a single layer of tunable weight trained with the Modified Cuckoo Search algorithm (MCS) for prediction of daily temperature and relative humidity. It was observed that FNN when trained with MCS produced less prediction error. Further, an attempt has been made for the prediction of relative humidity and temperature at different locations inside tobacco dryer by [21] by using a fitting ANN model. Another study performed by [22] also used different ANN models to successfully forecast indoor relative humidity and temperature in the education building of Izmir, Turkey.
Formerly, no attempt has been made for the prediction of relative humidity in the alpine catchment where there is an issue of data scarcity. The current study is unique because it uses two machine learning models such as MARS and M5T to predict the relative humidity in the Hunza basin (glaciated basin), Pakistan. MARS model was selected because it requires a short training process and has the ability to model complex nonlinear processes deprived of strong model assumptions as compared to ANNs models [23,24] whereas the M5T model was selected because of its small computation cost and ease in large data treatment as compared to support vector machine (SVM) and ANN [25,26]. In previous studies, mostly these models were used for the prediction of runoff in poorly gauged basins. A study conducted by [27] suggested that the MARS method is capable of predicting short-term runoff forecast in mountainous watersheds whereas MARS was successfully used for the prediction of streamflows with inadequate data input in the mountainous catchment by [28]. Similarly, the M5T model was found useful in the prediction of streamflows of several tributaries by [29] and it was observed that predictions are good in rainless periods. Another study conducted by [30] found the M5T algorithm reliable in the prediction of streamflows. Several other studies also encouraged the researchers to use MARS and M5T models for the prediction of runoff e.g. [31][32][33][34][35][36][37]. Apart from runoff prediction, MARS and M5T models were also used for the prediction of evapotranspiration (ET) and Pan Evaporation (Ep). A study conducted by [38] compared the performance of M5T, MARS along with calibrated Hargreaves-Samani (CHS), MLP, and Stephens-Stewart (SS) models and observed that MARS performed better in the prediction of Ep. Another study conducted by [39] found that the M5T model outperformed compared to Ritchie Equation for the prediction of ET. Similarly, [40] successfully predicted reference evapotranspiration by using M5T and ANN models.

Study area
Hunza is a glaciated sub-catchment of the Upper Indus Basin (UIB) and is located in the western Karakoram Himalayan region of Pakistan (Figure 1). The basin lies within the extent of 74°02 0 -75°48 0 E and 35°54 0 -37°05 0 N and encompasses 13,671 km 2 of the catchment area.
The elevation of the basin ranged from 1391 to 7850 m. About 20% catchment area of the basin is covered by glaciers [41] and there are 110 glacial lakes in the basin [42]. It is the main tributary of the Indus Basin Irrigation System (IBIS) and it contributes about 12% of UIB streamflows upstream of Tarbela dam [43]. The climate of the Hunza basin is arid to semi-arid and is normally categorized by two seasons, October to March as winter and April to September as summer. The weather conditions vary within the basin. At low altitudes, weather is hot whereas at high altitudes winters are cold and there are extensive variations in temperature extremes [44]. The mean total annual precipitation varies with respect to altitude; low altitude station such as Naltar (2858 m) receives more precipitation i.e. 660 mm as compared to high altitude station Khunjerab (4730 m) which receives 165 mm of precipitation. The meteorological station installed in between Naltar and Khunjerab (i.e. Ziarat, 3669 m) receives 292 mm of precipitation [45,46].
The temporal variations in meteorological variables of Khunjerab station (using data of 1995-2009) are displayed in Table 1. Table 1 shows that the maximum temperature varies between À11.1°C (January) to 11.6°C (July) whereas minimum temperature varies from À21.3°C (January) to 1.3°C (July). The maximum relative humidity in the basin varies from 59% (March) to 91% (August) while minimum relative humidity varies from 23% (March) to 52% (December). The daily solar radiation in the Hunza basin varies from 2563 (December) to 5148 (May) watt/m 2 .

Topography
The Advanced Spaceborne Thermal Emission and Reflection Radiometer (ASTER), Global Digital Elevation Model (GDEM) was used to delineate the catchment boundary of the Hunza basin. The Hunza basin was delineated using ASTER GDEM v3 data in Arc GIS. The data was acquired from the website: https://lpdaac. usgs.gov/tools/data-pool/. The format of the downloaded tiles was Geo-TIFF and has the gridding resolution i.e. (30 m) and tile structure (1°x 1°).

Meteorological data
There are four meteorological stations in the Hunza River basin such as Hunza, Naltar, Khunjerab, and Ziarat ( Table 2). The Hunza meteorological station was installed by the Pakistan Meteorological Department (PMD) and the record is available from 2007 to onward whereas the other three stations were installed and managed by Water and Power Development Authority (WAPDA) and the record is available from 1995 to onward. The current study has employed daily data of temperature, precipitation, solar radiation, and relative humidity of Ziarat, Naltar, and Khunjerab meteorological stations. The required data of the aforementioned stations were acquired from the Surface Water Hydrology Project of the Water and Power Development Authority (SWHP-WAPDA), Pakistan from 1995 to 2009 ( Table 2).

Machine learning models
The current study has employed two machine learning models such as M5 Tree and MARS for the prediction of relative humidity at three meteorological stations of the Hunza basin. Their detailed description is given below:

M5 tree model
The M5T model was first introduced by [47]. Model trees simplify the theories of regression trees and there are constant values at their leaves [48]. M5T model is established in relation to a binary decision tree where linear regression functions are placed in the terminal node (leaf) and a relationship is developed between dependent and independent variables through it [49]. Model development involves two stages; the first stage involves in creation of a decision tree by using a split criterion whereas in the second stage overgrown tree is pruned for designing the model tree [25]. The splitting stage in the M5T model is composed of regression function at the leaves instead of class labels and continuous numerical attributes can be estimated through it [36]. The splitting criterion for the M5T model procedure is based on the standard deviation reduction (SDR) function achieved in every node. This criterion points out the error in that node and the minimum expected error is calculated by the model because of testing each attribute in that node [50,51]. The SDR in the M5T model can be calculated by the following Equation [47]: Where SDR specifies the standard deviation reduction and sd indicates standard deviation; M specifies a set of examples that reaches the node; whereas Mi signifies the subset of examples that have the i th outcome of the potential set.
Because of the splitting or branching process, data in child nodes (smaller nodes) have less SD than parent nodes (greater nodes). The division process often results in producing a large tree-like structure which causes overfitting and this issue can be resolved by pruning back the tree [52], for instance by substituting a subtheme with a leaf. Pruning the overgrown tree and substitution of subthemes with linear regression functions are performed in the second stage of model designing. This method of producing the model tree separates the parameter space into subspaces and builds in each of them a linear regression model.

MARS algorithm
MARS model was first developed by [53]. Its working procedure involved establishing a relationship among a set of input variables and the target-dependent that involve connections with less number of variables [54]. MARS produces flexible models to facilitate the solution space to be divided into several intervals of independent parameters whereas individual splines are fit to each interval [53]. This method is non-parametric and non-linear and it involves a forward-backward procedure to predict a continuous dependent parameter in high-dimensional data [55]. No assumptions have been made about the fundamental functional relationships between independent and dependent variables by the MARS model. In MARS, the splines are connected smoothly together to form piecewise curves which are also known as basis functions (BFs), and these form a flexible model which is capable of handling both linear and non-linear behavior [54]. Two stages are involved in setting up the MARS model which includes forward (constructing the model) and backward (a pruning procedure) stages. In the first stage (forward), to define a pair of BFs candidates, knots are placed within the range of each predictor variable. To produce a maximum reduction in sum-of-squares residual error, the model adjusts the knot and its corresponding pair of BFs in each step. This process of adding BFs lasts and generally a very complex and overfitted model is produced. However, the overfitted model is pruned by deleting the less important redundant BFs in the backward stage [54,55].
The MARS model f(X) is generally expressed by the following equation; Where δ o and δ m denote the coefficients which are calculated by the least sum of squared errors from splines functions, whereas h m X ð Þ represents the spline functions, and M denotes the number of functions. The pruning stage improves the forecasting accuracy of the model and M is determined during this phase [55].

Model setup
The current study compares the accuracy of two machine learning methods such as MARS and M5Tree, for the prediction of daily relative humidity using different input data combinations of precipitation, temperature, and relative humidity. These machine learning models were applied on three meteorological stations such as Khunjerab, Naltar, and Ziarat one by one. The flowchart of the current study is displayed in Figure 2. Each model was applied on these stations separately with different input data combinations for the prediction of relative humidity (RH). Ten input data combinations were developed for each meteorological station by each model to decide the best input data combination for the prediction of relative humidity. Initially, three preceding relative humidity (RH) input combinations such as (i) RHt-1, (ii) RHt-1 and RHt-2, and (iii) RHt-1, RHt-2, and RHt-3 were tried to both the models to predict current RH (RHt). After that, three precipitation (i.e. (i) Pt-1, (ii) Pt-1, Pt-2, (iii) Pt-1,Pt-2,Pt-3) and temperature inputs (i.e. (i) Tt-1, (ii) Tt-1,Tt-2, (iii) Tt-1,Tt-2,Tt-3) combinations were separately added to the best RH combination whereas in the last input combination (10th); best temperature and precipitation inputs were added together with the best RH input combination to see the combine effect of both parameters on model's accuracy in predicting relative humidity.
The current analysis involves daily data of precipitation, temperature, and relative humidity from 1995 to 2009. About 75% of input data i.e. from 1995 to 2006 was used for training whereas 25% of input data i.e. from 2007 to 2009 was used for testing in both machine learning models for prediction of relative humidity. However, [8] used only two-year data i.e. 2008 to 2009 for training the LSTM model which might not be enough for reliable predictions.

Models evaluation criteria
The models' accuracy in relative humidity prediction against observed data was evaluated using the following statistics which are normally used in the related literature. The statistics include R 2 , RMSE, and MAE as shown in Eqs. (3)-(5).
Where rh indicates the observed mean relative humidity; RH is the mean of the predicted relative humidity RH i ; N signifies the number of data points. Moreover, RH iO is observed relative humidity and RH iM is modeled relative humidity. Previous studies such as [56][57][58][59][60][61] suggested that a single statistical indicator cannot examine well the prediction accuracy of soft computing models. Therefore, the current study used three statistical indicators to judge the model prediction accuracy with confidence. When the error distributions of the models are normal and uniform in that case the use of error statistics such as RMSE and MAE is more suitable. For an ideal model, the values of RMSE and MAE should equal to 0, whereas, R 2 should equal to 1. The model having relatively small values of MAE and RMSE as compared to other models is considered the best model.

Performance evaluation of MARS model in predicting relative humidity
The performance evaluation statistics of the MARS model for the prediction of relative humidity at Khunjerab, Naltar, and Ziarat are presented in Tables 3-5, respectively. The MARS model performed excellent for the prediction of relative humidity at all meteorological stations both during training and testing processes especially, it provided the best predictions for the 6th scenario (S6) of input data combination which is highlighted in bold. The RMSE, MAE, and R 2 values during the training (5.58%, 4.51%, 0.852) and testing (5.98%, 5.43%, 0.808) stages for Khunjerab meteorological station are displayed in Table 3. The MARS model performed better during training as compared to testing at Khunjerab. However, the MARS model did not perform well for the S1, S2, and S3 scenarios. Our study results were found better than the study conducted by [1]. They described that GEP and ANNs models can predict relative humidity reliably at two Californian stations (RMSE= 10.7%, MAE= 7.6% and R 2 = 0.73) during training; and (RMSE= 10.1%, MAE= 7.5% and R 2 = 0.714) during testing stage in the case of GEP model. However, ANN model produced better results as compared to GEP such as (RMSE= 7.8%, MAE= 3.6% and R 2 = 0.826) during training, and (RMSE= 8.2%, MAE= 4.1% and R 2 = 0.751) during testing stage.
Similarly, the MARS model provided the best prediction of relative humidity for the S6 input data scenario at Naltar both during training and testing stages as shown in Table 4. The RMSE, MAE and R 2 values for the best input parameter combination were 5.63%, 4.53%, and 0.826 respectively, during training whereas 6.58%, 5.08%, and 0.806, were during testing ( Table 4). The MARS model did not perform well for S1, S2, and S3 input combinations. However, a study conducted by [5] observed that the XGBoost model provided the best prediction of relative humidity (MAE= 2.29%) as compared to SARIMA (MAE= 2.97%) and HW additive (MAE= 2.74%).   However, the MARS model performed the best (RMSE= 5.86, MAE= 4.97%, R 2 = 0.815) for prediction of relative humidity at Ziarat for the S6 input combination during the testing stage as shown in Table 5. The MARS model also performed fairly well during training stage (RMSE= 5.26%, MAE= 4.59%, R 2 = 0.833) for S6 input combination. The MARS model provided a poor prediction of relative humidity for S1, S2, and S3 input scenarios ( Table 5). Overall, the MARS model performed fairly well at Khunjerab (R 2 = 0.852) and showed slightly low performance at Naltar (R 2 =0.826) for the S6 input combination during the training stage (Tables 3-5).
The MARS model performance was also evaluated by drawing scatter plots. The scatter plots had been drawn between observed and predicted relative humidity from 2007 to 2009 on daily data as displayed in Figure 3. Scatter plots also displayed that the MARS model outperformed for prediction of relative humidity at Bold values represent the best input data combination.  all meteorological stations, especially, at Ziarat with R 2 = 0.815 for the S6 input combination during the testing stage (Figure 3).

Performance evaluation of M5T model in predicting relative humidity
The performance evaluation of the M5T model for the prediction of relative humidity at Khunjerab, Naltar, and Ziarat is displayed in Tables 6-8 Bold values represent the best input data combination.  meteorological stations both during training and testing stages; however, it provided the best predictions of relative humidity for the 6th input data combination (S6) at all stations which are highlighted in bold. Overall, the M5T model performance was slightly lower as compared to MARS. The M5T model also performed better during training as compared to testing at all meteorological stations. However, the M5T model provided the best prediction of relative humidity at Ziarat as compared to Naltar and Khunjerab (Table 8). However, the M5T model did not perform well for the prediction of relative humidity for the S1, S2, and S3 scenarios with R 2 <0.50 at all meteorological stations (Tables 6-8). A previous study conducted by [8] observed that the LSTM model is capable of forecasting complex univariate relative humidity time series. On contrary, [3] suggested that ARIMA can provide a better prediction of relative humidity as compared to LSTM. At Khunjerab station, the M5T model performed well (RMSE= 5.94%, MAE = 5.08%, R 2 = 0.796) in case of S6 input combination during model training stage whereas it displayed low prediction performance (RMSE= 6.14%, MAE= 5.56%, R 2 = 0.772) during testing stage as shown in Table 6. Similarly, the M5T model did not perform well for the S1, S2, and S3 scenarios (R 2 <0.50). Similarly, at Naltar station, the M5T model performed reasonably well (RMSE= 5.82%, MAE= 5.12%, R 2 = 0.791) for S6 input combination during training stage whereas it exhibited a slightly low performance (RMSE= 6.19%, MAE= 5.58%, R 2 = 0.762) during testing stage as presented in Table 7.
The M5T model performance was also evaluated by drawing scatter plots. The scatter plots were drawn between observed and predicted relative humidity from 2007 to 2009 on daily data as displayed in Figure 4. Scatter plots showed that, the M5T model can also predict relative humidity fairly well at all meteorological stations, especially, at Ziarat (R 2 = 0.782) for the S6 input combination during the testing stage (Figure 4). Bold values represent the best input data combination.

Time variations of the observed and predicted relative humidity by MARS and M5T models
Time variations of the observed and predicted relative humidity by MARS and M5T model at Khunjerab, Naltar, and Ziarat meteorological stations are displayed in Figures 5-7. Time variations plots have been drawn by using the best-predicted data of relative humidity (i.e. S6 scenario). The daily data has been drawn from 2007 to 2009. Figure 5 showed that both the models captured time-series variations of predicted relative humidity very well with reference to observed data at Khunjerab station but slightly underestimated the values from 900 to 1100 days. Moreover, these models slightly underestimated the low and high values of predicted relative humidity with reference to observed data at few points throughout the time series. Overall, the MARS model performed better as compared to M5T for the prediction of daily relative humidity data at Khunjerab.
The MARS and M5T models also captured time-series variation of relative humidity superbly with respect to observed data at Naltar station for the S6 input  combination as displayed in Figure 6. Both the models slightly underestimated the predicted relative humidity from 850 days to 1100. Moreover, these models slightly underestimated the predictions of low and high values of relative humidity at some points throughout the study period. Overall, the MARS model provided better predictions of relative humidity as compared to M5T at Naltar (Figure 6).
However, both the machine learning models provided the best prediction of relative humidity at Ziarat which is a mid-altitude meteorological station as shown in Figure 7. Both the models captured the temporal variations of relative humidity very well throughout the period with reference to observed data for the S6 input combination. Furthermore, the models underestimated the low and high values of predicted relative humidity with reference to observed data. The MARS model predicted low and high values of relative humidity fairly well but it slightly underestimated the values at few points throughout the study period. Overall, the MARS model provided better predictions of relative humidity as compared to M5T at Ziarat (Figure 7).

Conclusions
Relative humidity has an important impact on plant growth, human health, industry, weather, and climate. Any change in temperature and relative humidity  may result in droughts, heatwaves, floods, and hurricanes. Thus the relative humidity is one of the important factors to measure environmental changes. Keeping in view the importance of relative humidity, the current study has attempted to predict the relative humidity in a high elevated alpine basin (Hunza) of western Karakoram by using the MARS and M5T machine learning models. The current study is novel in that respect that previously nobody tried to predict the relative humidity in a high elevation alpine basin.
Statistical analysis of the model outputs suggested that both the models produced reliable predictions of relative humidity at Khunjerab, Naltar, and Ziarat meteorological stations of the Hunza basin during both training and testing stages. Out of 10 input data combinations of temperature, precipitation, and relative humidity, the 6th combination (i.e. RHt-1, RHt-2, RHt-3, Tt-1, Tt-2, Tt-3) produced the best results for each station by each model. The statistical indicators confirmed the excellent performance of both the models at all stations. For the MARS model, RMSE, MAE, and R 2 values ranged from 5.26-5.63%, 4.51-4.59%, and 0.826-0.856, respectively, during the training stage while they ranged from 5.86-6.58%, 4.97-5.43%, and 0.806-0.815, respectively, during the testing stage. However, in the case of the M5T model, the RMSE, MAE, and R 2 values ranged from 5.74-5.94%, 5.04-2.12%, and 0.791-0.796, respectively, during the training stage whereas the values ranged from 6.08-6.19%, 5.46-5.58%, and 0.762-0.783, respectively, during the testing stage of M5T model. Both the models showed poor performance such as (R 2 <0.50) in the case of S1, S2, and S3 input combinations at all stations. Moreover, it was observed that both the models performed better in training as compared to the testing stage. Both the models outperformed at Ziarat as compared to other stations. Overall, the MARS model performed better than M5T at all stations. The current study is important and it will provide a baseline for future studies to predict the other meteorological variables such as temperature, wind speed, solar radiation, and evapotranspiration by using machine learning tools in high altitude and remote basins which face the issue of data scarcity.