The Role of Statistical Methods and Tools for Weather Forecasting and Modeling

The need to understand the role of statistical methods for the forecasting of climatological parameters cannot be trivialized. This study gives an in depth review on the different variations of the Mann-Kendall (M-K) trend test and how they can be applied, regression techniques (Simple and Multiple), the Angstrom-Prescott model for solar radiation, etc. The study then goes ahead to apply some of them with data obtained from the Nigerian Meteorological Agency (NiMet), and applying tools like the python programming language and Wolfram Mathematica. Results show that the maximum ambient temperature for Calabar is increasing (Z = 2.52) significantly after the calculated p-value < 0.05 (significant level). The seasonal M-K test was also applied for the dry and wet seasons and both were found to be increasing (Z = 3.23 and Z = 4.04 respectively) after their calculated p-values < 0.05. The relationship between refractivity and other meteorological parameters relating to it was discerned using partial differential equations giving the gradient of each with refractivity; this was compared with results from the correlation matrix to show that the water vapor contents of the atmosphere contributes significantly to the variation of refractivity. Multiple linear regression has also been adopted to give an accurate model for the prediction of refractivity in the region after the residual error between the calculated refractivity and predicted refractivity was minimal.


Introduction
The importance of statistical modeling and forecasting of time series data, etc., cannot be overemphasized. The benefits ranges from easy interpretability arising from visualization of results to the removal of the mysticism factor for the layman. The word 'forecasting' has to do with predicting the future based on data from the past and present. This is regularly done by the analysis of trends.
A routine example might be the estimation of temperature trends for some specified future date. Compared to forecasting, prediction can be seen as a term which is more general.
Forecasting methods have been applied in different areas ranging from climatology, finance, foreign exchange, etc. This has been applied in different regions of the world for better prediction and simulation. The key distinction in Information and Communication Technology (ICT) is the fact that with this technology, we can make predictions and simulations from previously obtained data. This is true and can be applied for every area while paying attention to the rules that govern them.
In this study we will be applying some statistical methods which can be adopted for the forecasting of climatic (weather) parameters in different regions of the world.
It is important to note that the predictability of the atmosphere is not perfect, this brings into context the fact that although statistical methods are necessary, results obtained are not totally accurate which is why room for errors (uncertainties) are given, albeit, a trend can be observed [1]. Statistical methods have been applied in the study of different regions for example, Daniel S. Wilks in [1] buttressed on the use of these methods on the analyses of different regions that do not necessarily have the same climatic condition. This brings into context the fact that laws are true irrespective of the region, i.e. neglecting all other factors that have little contribution to weather, the same methods can be applied in different regions to yield accurate results.
Analysis of trends can be useful in depicting and predicting the changing patterns and erraticism of some climatic parameters. This analysis gives a proper knowledge about the changing conditions of the climate and its effects, by the evaluation of meteorological parameters.
A data scientist using any tool or software for modeling and forecasting is particularly interested in the progression of these parameters (meteorological) as a function of time(t) f t ð Þ. The designers of navigation or monitoring systems cannot trivialize the importance of forecasting as this is a very important part of their system. The spatial and temporal changes of atmospheric parameters calls for the adoption of this analysis to discern the effects of some meteorological parameters on some variables; for example, see [2].
A very popular software for any data scientist that is willing to understand the nitty-gritty of weather forecasting is Python Programming. This paper will explain in detail the setup processes for this to help the layman get started. A dataset of temperature trend in Calabar, Nigeria will be used at the end of this chapter to test the processes explained for better visualization.
The applicability of results from forecasting cannot be underestimated because this is great information for people that depend on weather conditions like farmers, surfers, and event planners, etc. The accurate prediction of atmospheric parameters can go a long way in positively affecting the financials of the informed, as money can be saved by avoiding unnecessary cost during trying times [3]. Natural disasters like Tsunami can be predicted with the correlation of meteorological parameters, harnessing information as explained previously and then incorporating this information through machine learning into the design of forecasting systems.
We delve deeper into a review of statistical methods like the M-K test and its different variations, the Angstrom-Prescott model for the estimation of solar radiation, linear regression techniques, with a deep look into multiple linear regression which will be applied in predicting refractivity after obtaining the coefficients of the variables. Results will be obtained and explained.

Review of statistical tests/methodology
With the shift going on in the world of technology, the implementation of some time series forecasting methods will be explained as well as their python implementation techniques. We often use forecasting models on time series data for the estimation of future trends of meteorological parameters.

Statistical test for trend (Mann-Kendall trend test)
One of the most important and widely applied test for trends involving time series is the Mann-Kendall trend test. It is mostly used for environmental and hydrological data. The test is non parametric and does not necessitate the data conforming to a particular distribution, similarly, the sensitivity of the test due to an inhomogeneous series resulting to abrupt breaks is very low [4]. The null hypothesis H o which says that there is no monotonic trend in the series, is tested against the alternative hypothesis H 1 which says that there is a trend in the series. The test is applied to cases where a range of data x i is in agreement with the equation below; f t i ð Þis a function of time and ε i are the range residuals with zero mean. The Mann-Kendall test statistic S is calculated using the formula where; n in Eq. (2) is the number of data values in the studied series. The advantage of this test is that it can handle the situation where data values are incomplete with respect to the number of years or months, etc. [4] In the case where n is greater than or equal to 10 (10 and above), we adopt the normal approximation (Z).
To find the variance of S, 'VAR(S)', we compute Eq. (4) below.
From the equation, the number of data values is represented by n, the number of equal of tied groups is represented by g, and the number of data values in the p th group is represented by t p .
We now use the results from VAR(S) to find the test statistic Z A decreasing trend can be discerned from results of Eq. (5) when the value of Z is negative and an increasing trend when Z is positive ( Table 1).
The significance of an increasing or decreasing trend is observed when the p-value of the series is lower than the significance level ∝ ð Þ, in this case, we can say there is a trend observed trend in the series [5]. The adoption of different significant levels with respect to the number of given data values n is given in Table 1.
The classification of this probability/significance level is important because results can be confused to be entirely true. We need to understand that the significance level of say 0.05, means that there is a 5% probability that a mistake will be made while rejecting the null hypothesis H o . Similarly, a significance level of 0.01 means that there is a 1% probability that a mistake will be made while rejecting H o :

Regression analysis
The two easiest ways to forecast time series data by observation are the simple regression and the moving average, they both depend on historical data. The former demands mere observation of the previous trend and drawing up an extrapolation from there; this can be somewhat less accurate. The moving average has been used for forecasting meteorological data like rainfall (See reference [6]). Analyzing with regression has to do with the relationship one variable which is dependent has with one or more independent variables. We use them to check for models showing the strength of relationship between the variables and any possible future relationships [1].

Simple linear regression
This regression variation is based on the assumption that the two variables (dependent and independent variable) show a linear relationship between the intercept and the slope, similarly, there is no residual error in this regression and the value is constant across all observations.
Y is the dependent variable. X is the independent variable. m is the value of the slope. c is the intercept. e is the residual error. The regression is depicted by a straight line describing the Eq. (6) above (Figure 1).

Multiple linear regression
This model is similar to that of simple linear regression, but the only exception is that it has multiple independent variables, unlike that of simple linear regression which has just the one. This can be represented by Eq. (7); Y is the dependent variable. X 1 , X 2 , X 3 are the independent variables. m 1 , m 2 , m 3 are the values of the slopes. c is the intercept. e is the residual error.
One thing to note about multiple linear regression is that the independent variables must not be collinear, i.e., they do not have to have a high correlation coefficient between each other, else there will be difficulty in assessing the relationship between the dependent and independent variables.
We also need to take note that before multiple linear regression is performed on range of data values, a linear relationship must exist between each independent variable and the dependent variable. The amount of residual error must be almost constant at each point in the model. The multiple linear regression will be applied to study and predict refractivity trend in Calabar, Nigeria. This was done with the 'statsmodel' package in python programming and results have been displayed in section 2.5.
A perfect meteorological equation that this regression technique can be applied to is the refractivity equation recommended by the International Telecommunication Union (ITU) shown in Eq. (8); T is the Absolute Temperature (K).
Eq. (8) shows the relationship between refractivity (dependent variable) and meteorological parameters (ambient temperature, atmospheric pressure, and vapor pressure) which are all independent variables.
This has been applied in [7] modeling the meteorological parameters for the accurate determination of refractivity. These meteorological parameters (Ambient Temperature, Atmospheric Pressure and Relative Humidity) have been obtained from the Nigeria meteorological Agency (NiMet), Calabar.
Results have been presented in section 2.5. From Eq. (8), we obtain the atmospheric vapor pressure e from the relation;

Review of the application of simple linear regression analysis in climatology (the Angstrom-Prescott model)
The linear regression technique can be applied to find the relationships between an independent variable and the dependent variable. We can see the explanation of this from Eq. (6).
One major example of the benefits of linear regression is the estimation of the Angstrom-Prescott coefficients of the Angstrom-Prescott model for a particular region as this relates to solar radiation. The Angstrom-Prescott model is given by [8]; where the monthly average daily extraterrestrial radiation is given by H 0 , H is the monthly average daily global radiation in Wh/m 2 /day. n is the actual sunshine duration in a day for a particular region (hours), N is the monthly mean length of the day in hours. The Angstrom-Prescott empirical coefficients are given by a and b. The linear regression technique has been adopted by Srivastava and Pandey [8] to find by a and b. Comparing Eq. (6) to Eq. (9) we have that; This shows that if we have the variables ' H H 0 and n N ', we can get the values of a and b, from our Y intercept and slope respectively. Getting these constant values for specific regions will help us forecast future trends.
For better understanding, the extraterrestrial radiation H 0 is given by the equation [9]; Here, I SC is the solar constant with a value of 1367 W/m 2 , d represents the day of the year (from January 1st to December 31st); taking January 1st as 1 and December 31st as 365 or 366 (in the case of a leap year). The latitude of the study location, the declination angle and the sunset hour angle are represented by ϕ, δ, and ω respectively. ω ¼ cos À1 À tan ϕ tan δ ð Þ . The declination angle can be obtained from [9].
The monthly mean length of the day (in hours) can be obtained from [9].
The above equations can be applied to estimate the coefficients using linear regression. By this we can use these coefficients to predict solar radiation for a given region.
We know that the declination angle ranges from À23:5 ≤ δ ≤ þ 23:5. From Figure 2, we can see that the declination angle is 0°C at the Verbal and Autumnal Equinox, while the angles are À23.5 and + 23.5 at the summer and winter solstice respectively. It is easy to see why this has a huge effect on the variation of Global solar radiation.
Klein in 1977 [10] recommended average days of the various months and corresponding angle of declination as in Table 2.

Calculus in climatology
Applying calculus in environmental science is important in predicting a lot of things. It can be applied to understand the impacts of parameters on the variations of other parameters that they relate to. It is important to know that calculus is the 'mathematical study continuous change' so this can be applied in climatology to discern the impacts of some parameters on the "continuous change" of others [11][12][13].
Writing the refractivity equation in terms of relative humidity H, by substituting (10) into (9), and the into (8) Similarly, obtaining refractivity in terms of the saturated vapor pressure e s using Eq. (8) and (9) gives; Now applying partial differentials to the equations for refractivity; Eqs. (8), (16), and (17), we obtain partial differentials relating each parameter to refractivity; From monthly Temperature, Humidity and Atmospheric pressure data obtained for 2005-2018 from the archives of the Nigerian meteorological agency (NiMet) Calabar, the atmospheric vapor pressure and the saturated vapor pressure can be obtained by applying these parameters in Eqs. (9) and (10) (Figure 3).

Python implementation for Mann-Kendall trend test
With the python software installed, the next step will be installing an IDE (integrated development environment). The easiest IDE to use is the Jupyter Notebook. This IDE displays results as you code.
We will walk you through the processes for analyzing data by using the data for Calabar in the south of Nigeria, collected from the archives of the Nigeria meteorological agency (NiMet). Research has been done in this area in climatology [14][15][16][17][18], but with the application of python and the Mann-Kendall test can give more meaning to time series data. We need to install the python package for the Mann-Kendall test called 'pymannkendall'. To install this package, the following python packages are required; • Numpy

• Scipy
For handling and cleaning data we need the 'pandas' package, and for data visualization we need the 'matplotlib' package.
We want to analyze maximum ambient temperature data for 20 years in Calabar.
In the Jupyter notebook, the first step will be to import the respective packages. We must also note that for our examples in the Appendices, we stored the excel file containing the data used for the analysis in the same folder as the python file for easy reference.
Appendix A shows the process of importing the installed packages required for the analysis into the workspace.
Before we perform the Mann-Kendall test, we need to import the excel file titled 'Temperature' in which the table is stored, in a sheet name called 'MAX'. See Appendix B.
Appendix C shows how the Mann-Kendall original test is performed after importing the packages and data. We assigned the name of the imported data file as 'Max' and set the significance level ∝ ð Þ to the default 5% (0.05); this can be adjusted by the user to his/her preference. Results were obtained and displayed in Appendix C.
We now perform the seasonal M-K test for the dry season variation, we import the excel file titled 'Temperature', the date column will be an index column. The sheet name of the excel file in which the data is stored is called 'dry'. This implementation can be seen from Appendix D.
Appendix E shows the seasonal M-K test python implementation for the dry season variation. By setting the significance level ∝ ð Þ to the default 5% (0.05), and the period to 4, which stands for the 4 months of the dry season in the study area (November to February), we have satisfied the criteria for the seasonal M-K test.

Map of study area showing Calabar as a coastal area (left) and the exact location of the Nigerian meteorological agency (NiMet) where the data was obtained (right).
For the wet season variation, the excel file titled 'Temperature' will be imported and the date will be an index column. The sheet name is called 'wet'. Appendix F shows the implementation code for this importation.
We can now perform the seasonal Mann-Kendall test on the wet season data. Appendix G shows this. The Seasonal Mann-Kendall test of the imported file we assigned the name 'wet' has been achieved by setting the significance level ∝ ð Þ to the default 5% (0.05); this can be adjusted by the user to his preference. We also set the period to 8, which stands for the 8 months of the wet season in the study area (March to October).
There are other variations of the Mann-Kendall test along with their python implementation [19]. These can be used depending on the data obtained and the aim of the test.

Discussion
For the annual variation in Figure 4, results show that there is a trend in the series as the p-value is less than the significance level (0.05). The positive Z value (observed from Appendix C) shows that the series is increasing. We can conclude that the maximum ambient temperature variation is increasing, and it is doing so with significance, the slope of the trend can be observed from the results in Appendix C.  For the dry season variation observed in Figure 5, results show that there is a trend in the series. The positive Z value of the dry season trend observed from Appendix E shows that the series is increasing. We can conclude that the maximum temperature variation in the dry season is increasing significantly as the calculated p-value is less than the significance level (0.05), the slope of the trend can be observed from results in Appendix E.
For the wet season variation observed also in Figure 5, results show that there is a trend in the series. The positive Z value from Appendix G shows that the series is increasing. We can conclude that the maximum temperature variation in the dry season is increasing significantly as the calculated p-value is less than the significance level (0.05), the slope of the trend can be observed from the results in Appendix G.
These results are in agreement with Agbo et al. [2] for the same region.

Relationship between refractivity and meteorological parameters
To understand the relationship between refractivity and all parameters relating to it, we adopt Eq. (18) by substituting obtained and calculated data.
From the data obtained at the Nigerian Meteorological Agency (NiMet) Calabar, and adopting Eq. (9) and (10) Results from the gradients of the differential equations in Eq. (19) show that the vapor pressure and saturated vapor pressure contributes more to the variation of refractivity. The relative humidity similarly has a high gradient; this can be physically explained by relating the water vapor content of the atmosphere to the variation of refractivity.
The correlation plot of refractivity and all other meteorological parameters is shown in Figure 6. Results agree with that of the differential equations in Eq. (19). As seen in Eq. (19), the correlation plot showed that the atmospheric vapor pressure and relative humidity had high positive relationships with refractivity. The saturated vapor pressure however has a low correlation coefficient compared to the high gradient in Eq. (19); this can be interpreted thus; that the variation of the saturated vapor pressure has a relatively high contribution to the variation of refractivity, but the saturated vapor pressure does not have a similar trend to that of refractivity.

Application of multiple linear regression in climatology
Multiple linear regression has been applied to relate refractivity with obtained meteorological parameters. The goal is to obtain an equation that relates refractivity to meteorological parameters through Multiple Linear Regression (MLG). Using Eq. (8) to calculate refractivity, we show results in Table 3. As part of the conditions for carrying out multiple linear regression, we have to test for collinearity  between the independent variables. We see from the correlation matrix in Figure 6 that the independent variables are not collinear, hence this satisfies the criteria for carrying out MLG. From our analysis we obtain the coefficients (slopes) of the variables (meteorological parameters) and the intercept from  (20) The above equation can be used to accurately predict the variation of refractivity, given the values of the meteorological parameters. Table 4 shows these results obtained from the multiple linear regression. The values for the predicted refractivity (Predicted N) was gotten from Eq. (20) by substituting the values of the meteorological parameters. This equation is more straight forward that the equation recommended by ITU as all the variables and coefficients are all linear with respect to refractivity. Figure 7 shows the trend of refractivity calculated from Eq. (8) with that of predicted refractivity, calculated from Eq. (20). The residual error seen from Table 5 shows relatively constant values (in agreement with our MLG conditions), and a small deviation from the original values of refractivity.
From Table 4 probability values (p-values) of the parameters are all less than the significance level (5% = 0.05; 95% confidence level), this shows that the variation agrees with the alternative hypothesis and shows a trend relating the independent variables to the dependent variables.

Figure 7.
Comparison plot of annual refractivity and predicted refractivity. Figure 7 show the minimal error between the predicted refractivity and the calculated refractivity. Table 5 shows the values for both as well as the residual error between them. This shows that the error is small and thus, Eq. (20) can be adopted for the prediction of refractivity for the study area. This equation can be modified and refractivity N can be gotten in terms of other parameters like the saturated vapor pressure and the atmospheric vapor pressure.

Conclusion
There are myriads of ways in which weather can be forecasted and this arises from the understanding of basic meteorological parameters and how they behave in the atmosphere; and also from the understanding of the role of statistics in climate research [21]. Research in this area has been reviewed to give a better understanding of the different techniques for analyzing trends; which include, Linear Regression (Multiple and Simple), the Mann-Kendall trend test [22,23] (to test for trends in a time series variation), the Angstrom-Prescott model for estimating solar radiation as well as the python implementation of some various techniques.
The multiple linear regression technique was applied to model an equation to accurately predict the trend for refractivity in the study location, the simple linear regression technique has been explained as well as accurate methods for its application in the predicting/estimation of the Angstrom-Prescott coefficients. These coefficients can be gotten for specific regions and can be accurately applied to predict solar radiation in that region.
Results from the multiple linear regression gave an accurate model for the prediction of refractivity in the region after the residual error between the calculated refractivity and predicted refractivity was minimal.
The Mann-Kendall original and seasonal test has been applied to analyze the maximum temperature in Calabar, Nigeria for the annual and seasonal (dry and wet season) variation respectively, and results show that the annual, dry season and wet season had increasing variations (after having positive Kendall Z-values of 2.52, 3.23, 4.04 respectively) and they were all increasing significantly at 5% (0.05) level of significance after their p-values were all less than 0.05 agreeing with Agbo and Ekpo [23].
The relationship between refractivity and other meteorological parameters relating to it was discerned using partial differential equations giving the gradient of each with refractivity; this was compared with results from the correlation matrix to show that the water vapor contents of the atmosphere contributes significantly to the variation of refractivity.

A.4 Appendix D
dry=pd.read_excel("Temperature data.xlsx", 'Sheet2', index_col= 'YEAR') The excel file titled 'Temperature' will be imported and the data will be an index column. The sheet name is called 'dry'.
We can now perform the Mann-Kendall test