Open access peer-reviewed chapter

The Efficiency of Polynomial Regression Algorithms and Pearson Correlation (r) in Visualizing and Forecasting Weather Change Scenarios

Written By

Okba Weslati, Samir Bouaziz and Mohamed Moncef Serbaji

Submitted: 22 September 2021 Reviewed: 17 January 2022 Published: 22 March 2022

DOI: 10.5772/intechopen.102726

From the Edited Volume

Recent Advances in Polynomials

Edited by Kamal Shah

Chapter metrics overview

217 Chapter Downloads

View Full Metrics

Abstract

In this chapter, we will discuss the application of Python using the polynomial regression approach for weather forecasting. We will also evoke the role of Pearson correlation in modifying the trend of climate forecast. The weather data were processed via Aqua Crop by introducing daily climate observations. Accordingly, the software outputs are: reference evapotranspiration, maximum and minimum temperature, and precipitation. Additionally, we focused on the interference of the input data on the efficiency of predicting climate change scenarios. For that matter, we used this machine learning algorithm for two case studies, depending on the type of input data. As a result, we found that the outcome of polynomial regression was very sensitive to those input factors.

Keywords

  • python
  • polynomial regression
  • Pearson correlation
  • AquaCrop
  • weather forecasting

1. Introduction

Weather forecast is playing a vital role in society, environment, and sociable development. Their utilities have exceeded the simple mission of providing information for the users about the weather behavior during the upcoming period. Indeed, the whole concept was changing where it has been applied recently for politics and deciders to reduce socioeconomic losses that could be potentially generated by climate, which plays a substantial role in assuring life quality and economic prosperity. For example, the United States has cited 96 natural disasters that occurred between 1980 and 2006, with total losses that exceed 700 billion dollars. Up to now, around 629 fatalities per year are directly caused by weather disasters. Moreover, more than 60,000 premature deaths are recorded annually and are originally caused by poor air quality. Additionally, more than 1.5 million road crashes are originally caused by weather causing 7400 deaths, 700,000 injuries, and around 42 billion losses. Economically, more than 42 billion dollars are estimated to be lost due to weather traffic delays [1]. On the other hand, researchers are using weather forecasting to set new strategies that benefit the most from weather changes. Environmentalists are rather focusing on protecting the ecosystem from any possible climate change threats.

The services of weather are rising sharply. Gained profits from weather forecasts and warning disasters have reached over 31 billion dollars. Eventually, good forecasting will automatically lead to earlier warming and thus to good precautions, which will eventually contribute to reduce weather fatalities and economic loss. Technics of observing and studying weather behavior are continuously progressing, but understanding how the climate reacted in the future is still often too hard to simulate. Related to this point, the diversity of machine learning algorithms has led to predict various climate responses, providing more and more accurate forecasts. Since then, details that deal with the numerical and spatial resolution are being continuously developing, acquiring more challenging computing capabilities.

In the meanwhile, the climate is still too sensitive and chaotic. Even so, it is often too hard to obtain the perfect forecast, but ideally, every weather forecast model needs to get a certain measure of confidence, depending on many external parameters (type of data, coordinates, altitude, etc.). Generally, the best forecast model is the one that implements voluminous data and various methods/models. We must therefore limit the potential parameters that influence mostly the behavior of climate. As a consequence, machine learning algorithms can generate multiple forecast scenarios with slightly different initial inputs and/or changes in some stochastic parametrizations or model formulations. The forecast uncertainty is dependent mostly on the initial states of the atmosphere where it was observed, plus a certain random factor. Any changes in the format of weather data can modify the real observed status of the atmosphere at a certain point. Consequentially, a generated model needs much time between processing and delivering to the users.

Machine learning is a field of computer science based on artificial intelligence. The whole concept is based on provide learning capability to the users (computers or other devices) without being explicitly operated. It aims to conceive a suitable models and algorithms to learn and forecast based on input data [2, 3]. In addition, machine learning algorithms are efficiently used to describe the behavior of the dataset, dealing with noisy and nonstationary data. It uses a model input features by producing an expected output and forecasts suitable output features established on its historical records. Given the wide availability of weather data, fast and accurate decision-making is becoming a vital and more important than ever. Therefore, machine learning algorithms are one of the best alternatives of forecasting weather behavior. Besides, they can easily adapt themselves to changing trends inside datasets and can thus generate models based on input data instead of applying a conventional generalized model.

Many research studies have spoken about weather forecasting using several analysis methods. Preliminary studies about weather forecasting were based on persistence and statistical methods such as regression models [4]. These models are the most common approaches that use statistical technique for weather prediction. One of the most famous regression models is polynomial approach, which provides an effective way to describe complex and voluminous dataset in a nonlinear form. Additionally, polynomial regression models are based on the observed relationship between the dependent and independent variables to find out the most suitable polynomial equation order.

Therefore, the present chapter lays out the outcome of using the polynomial regression models for weather forecasting and to expose all the factors and/or parameters that can potentially affect the efficiency of the prediction. We will reveal the risk of dealing with big data volumes and how the format of the input data will affect the accuracy of the model. We will review the application of the Pearson correlation in retrieving or homogenizing the data and how it can affect the prediction climate behavior. We will discuss the best polynomial algorithm that fits mostly with the type of the data and its ability to generate a valid, concrete and unquestionable weather forecast model.

Advertisement

2. Material and data

Meteorological data are derived originally from two sources. In the first case, daily climate data were collected and processed in AquaCrop to calculate a yearly average for every one of the following parameters: maximum and minimum temperature (Tmax and Tmin), reference evapotranspiration (ET0), and precipitation (P). These data were assembled for the watershed of Mellegue, between 2002 and 2019. In the second case, we collected monthly precipitation for the city of Zaghouan (Tunisia) from CHIRPS website for the corresponding period of 1981–2019. We want to test the efficiency of machine learning algorithms toward the type of input data. Later, precipitation data were used into the polynomial regression equation to predict the rain conduct for the next decade (2020–2030).

2.1 Polynomial regression

2.1.1 Method

The main algorithm used in this study is polynomial regression. It has been widely applied, and its statistical tools are famous [5]. Generally speaking, it is a form of linear regression that is why we do call it sometimes “polynomial linear regression model.” It is a form of regression analysis that links the independent variable (x) to the dependent variable (y) according to the nth polynomial degree [4]. The general equation of polynomial regression is written in the following form:

y=β0+β1x+β2x2+βnxnE1

β1 = linear effect parameter.

β2 = quadratic effect parameter.

β0 = a constant parameter, which is determined according to the polynomial function when x = 0.

Hence, we can find in some polynomial equations a certain factor (θ) called residual error.

In another way, we can define polynomial regression as a form of linear regression between dependent and independent variables where we add some terms or factors (based on a curvilinear relationship) to convert it into a polynomial regression. Nonetheless, we can simply return to the linear regression model. In that case, the previous equation will be:

y=β0+β1xE2

2.1.2 Why polynomial regression?

The general behavior of any climate data is set to be a nonlinear way. As a result, the linear regression model will be too difficult to visualize or predict entirely the data. For that reason, it will be too hard to draw the best line that fits mostly the weather data. The performance of the model will be too far from reality, and the projection of the weather forecast will be, consequently, too doubtful. In this case, we opt for the polynomial regression to fit the data graph with a low value of error (Figure 1).

Figure 1.

Difference between linear and polynomial regression.

2.1.3 How to build polynomial regression in Python?

Python has several methods for finding the curvilinear relationship between data points. Based mainly on mathematic equations, the algorithm will succeed in drawing the best polynomial regression line that fits the most with the original input data. Later on, we can use the collected information to predict future values with a specific timescale. In Python, many packages were imported to operate this algorithm, but the main needed libraries are: “pandas,” “matplotlib,” “numpy,” and “sklearn.”

The process is quite easy with Python (Figure 2). The first step consists of importing major libraries especially “Pandas,” “numpy,” and “sklearn.” You need to import the CSV file, which must have at least two columns; the column A contains the date (you have the right to put the conventional format of date you want, but be aware that it must be understood by Python!), and the second column that contains the observed values (which corresponds to the climate stats in our cases). Next, you have to specify the corresponding column to every variable and order the value gradually, by using the imported library “numpy.” Going this far, all you have to do next is to apply the corresponded polynomial order that ideally fits with the observed data. You can assess the behavior of the polynomial regression graphically (using the libraries matplotlib) or statically (R2 and “model. summary”). If you find the perfect model order, you can go further by forecasting the results at a specific time period.

Figure 2.

Script of polynomial regression model in python.

2.1.4 How to judge the efficiency of polynomial regression?

  • Statically: based on R2 (coefficient of determination)

By definition, R2 is an indicator that measures the quality of the prediction by linear regression. It is also known as the Pearson linear determination coefficient. It measures the fitness accuracy between the model and the observed data or how well the regression equation is suited to describe the distribution of sampling points. The coefficient varies from 0 to 1, referring to the strength of the prediction model. Meaning if R2 is zero, the chosen mathematical model (based on the linear regression) does not succeed in describing and/or fitting the distribution of the points. In the opposite, if R2 is 1 (or close to 1), we can conclude that selected regression line is able to determine the entire (100%) point cloud and the ability of the chosen mathematic equation in describing the points distribution. To conclude with, the closer R2 is to zero, the more scatter plot will be dispersed around the regression line. Inversely, if R2 tends to 1, it indicates that cloud points will congregate/narrow around the regression line.

  • Graphically: based on back-forecasting

When we say predict, it is often related to anticipate the future response based on past or actual values. Instead, we want to compare the forecast equation with real data. So, here we do invent the conception of back-forecasting. Generally, any time series model looks the same by going forward or backward and can, therefore, predict the future as well as the past. Meaning instead of going forward, we use the model to go backward in time and predict the past. The model uses two backwards passes. The first one is used to calculate early data by the estimating parameters while the second one is made to establish the forecasting equation to calculate the forward values (future). In Figure 3, we used the back-forecasting method to squeeze most information out of the 468 monthly data (1981 to 2019). The more back-forecasted graph is close to the real data plot, the more accurate the forecasting equation will be.

Figure 3.

Figure assimilation between back-forecasting plot (based on ARIMA (1,1,1)) and real rainfall values of Zaghouan.

2.2 Pearson correlation

Also known as the bivariate correlation, it is used to measure the strength of two sets of data based on the linear relationship. Mathematically speaking, it refers to the ratio between the covariance and the standard deviation of two variables (Eq. (3)). As such formula, the resultant value will always set between −1 and 1 where values close to 1 are referring to good correlation as the non-correlated values are close to zero. Generally speaking, the absolute values of the given correlation will be automatically linked to the strength of the values (negative values that are close to −1 are indicating that the two variables are inversely correlated, meaning that the rise of the first variable will lead to a decrease of the other).

The general equation of r (ρx,y) is written in the following formula:

ρx,y=covXYσxσyE3

Where ρx,y: r correlation.

Cov(x, y): covariance.

σ: Standard deviation.

We have to test the relationship between the data based on the Pearson coefficient. By proceeding this correlation, we have neglected all data from the stations that have a correlation less than 0.6 as it is recommended in some studies [6]. Lately, we have to rerun the polynomial regression for the validated data that has only a strong correlation. The program is too simple with Python, you can find here (Figure 4) the script of the operating Pearson correlation and how to export the results into a CSV file.

Figure 4.

Script of the application of Pearson correlation in python.

2.3 Data type

2.3.1 Yearly data: Mellegue catchment

The watershed of Mellegue is situated between 36° 25′ 50.43″, 35° 12′ 20.74″ north and 7° 11′ 30.98″, 8° 55′ 7.99″ east (Figure 5). The area covers 10,500 km2 where approximately 60% of the surface belongs to Algeria, whereas the remaining surface, as well as the outlet, is situated in Tunisia. The local climate is known as arid to semiarid with a slight change in sub-humid in the north. As a consequence, the area is covered with low vegetation where few forests are situated mainly in the northeast of the basin. The average yearly temperature is around 17°c with a slight change of 1–2°C. The winter is responsible for providing 50% of the annual rainfall. The so-called Oued Mellegue represents the main river of the catchment. It is 290 km long, cuts the watershed in a northeastern southwestern direction. Due to its large area, we assembled weather data from 23 stations scattered all around the catchment. Later on, the large amount of data was combined to compute a yearly average value for each weather variable. By homogenizing all the weather data together, it will be very helpful in understanding and visualizing the response of the watershed due to climate change and relieving voluminous data processing by the model. Hence, dealing with each station separately will lead to a lack in predicting the climate behavior of the area [7].

Figure 5.

Geographic location of the watershed of Mellegue.

2.3.2 Monthly data: the city of Zaghouan

The city of Zaghouan is a Tunisian metropolitan located under the coordinates 36° 24′ north/10° 09′ east (Figure 6). The city is situated in the northeast of Tunisia, specifically on the hill of the Jbel Zaghouan (1295 m altitude). The region is known for the abundance of high reliefs and a large number of water sources because of the active seismicity of the zone, mainly manifested in the fault of Zaghouan. This mechanic structure extends for approximately 80 km along with the northeast-southwest Atlasic trend where total vertical displacement exceeds 5 km. The climate of the study area does belong to the semiarid where annual temperature goes on the average of 18°C, whereas total rainfalls are approximately 500 mm. The estimated population of the city is around 20,837. Local activities rely in a big part on agricultural activities; around 300,000 ha are dedicated to agriculture where about 1.4 million quintals are gained yearly [8, 9].

Figure 6.

Geographic location of the city of Zaghouan.

2.4 Results

2.4.1 Yearly data: the case of Mellegue catchment

2.4.1.1 ET0

The results of polynomial regression (Figure 7 and Table 1) are showing a general decrease in ET0 compared with the initial states, which recorded 1250 mm/year in 2019. The overall decrease is estimated at the average of 11 mm/year as it will reach, by the end of 2030, 1115 mm /year. The polynomial regression based on Pearson correlation has intensely decreasing the results, where forecasting ET0 is estimated to reach 1060 mm in 2030 with an average loss of 16.2 mm/year. Many causes have been mentioned earlier to link the decrease of reference evapotranspiration. But two of the most influenced factors are revealed to be the temperature and winds. In fact, many studies have predicted the increase in temperature, which may go up to 5°C (depending on the coordinates, local climate, etc.). Inversely, the rise of the surface temperature will lead to a general decrease in ET0. This phenomenon happened to be known as the “Evaporation Paradox” [10, 11]. On the other hand, the expansion of urbanized areas will also affect ET0 where it will undoubtedly generate more polluted air, which has a severe negative impact on the behavior of ET0 [12]. Additionally, the general decrease of water bodies could lead to an indirect decrease in the annual value of ET0 [13].

Figure 7.

Quadratic polynomial regression forecast.

StationPoly (° = 2)Pear Corr
R2AlphaBeta 1Beta 2
10.44271982.188−67.6030.016Valid
2−135589.162140.694−0.036Valid
3−898739.862903.865−0.227Valid
4−1260271.1811264.865−0.317Valid
5−1060164.5561063.572−0.266Valid
6−182580.374185.367−0.047Valid
7−355144.733358.103−0.090Valid
8−333614.344336.200−0.084Valid
9−306578.982308.705−0.077Valid
10−856433.461858.485−0.215Valid
11−1394177.1441402.250−0.352Valid
12−1436810.5181445.436−0.363Valid
13−1245118.6281252.851−0.315Valid
14−878430.329882.910−0.222Valid
15−520073.866522.075−0.131Valid
16−2341253.3492342.237−0.585Valid
17−2708961.4702715.734−0.680Valid
18−856517.732873.275−0.222Valid
19−36331.81037.729−0.010Valid
20−1060164.5561063.572−0.266Valid
212746547.110−2734.7340.681Invalid
22316197.814−327.5560.085Invalid

Table 1.

Coefficient of prediction of ET0 for Mellegue catchment.

2.4.1.2 P

Based on the following results (Figure 8 and Table 2), the forecast of rainfalls is generally stabilized with a slightly increasing tendency, where it predicts a value of precipitation that reaches 575 mm/year in 2030. Inversely, the application of Pearson polynomial regression has shown a decreasing in P with a total rainfall of 352 mm annually by the end of 2030. Based on those values, we can conclude that we are still under the semiarid climate conditions. Hence, it is always too hard to predict the rainfalls because of the variety of the climate and the complexity of the related parameters (evapotranspiration, diurnal temperature, wind, relative humidity, etc.). The polynomial rise in rainfall could be explained by the general aspect of semiarid climate where precipitations are seasonally variable in intensity and quantity. Inversely, the prediction of Pearson polynomial regression agrees with many other studies that suggest a considerable loss of precipitation amounts by the year 2100, which will be noticeably remarkable in Mediterranean regions [14].

Figure 8.

Quadratic polynomial regression forecast of P.

StationPoly (° = 2)Pear Corr
R2AlphaBeta 1Beta 2
10.451955249.056−1944.2130.483Invalid
2613267.850−612.6590.153Invalid
3715018.701−716.2870.179Invalid
42811409.072−2808.5970.702Invalid
5−362350.042383.213−0.101Valid
62694179.641−2670.2720.662Valid
72680735.502−2657.0380.658Valid
82108146.344−2084.8300.515Valid
9572407.085−558.1330.136Valid
10315977.725−312.1090.077Valid
112445104.816−2433.1650.605Invalid
123188164.611−3170.5340.788Valid
132497308.704−2475.3760.614Valid
141209763.715−1185.4690.290Valid
151464559.714−1439.6740.354Valid
16−60450.99572.904−0.021Valid
17−1642843.7901643.988−0.411Valid
186646561.474−6631.8491.654Invalid
198793045.164−8760.0922.182Invalid
20−1108225.7641126.116−0.286Valid
2144922.783−33.6710.006Valid
224539127.186−4520.0441.125Invalid

Table 2.

Coefficient of prediction of P for Mellegue catchment.

2.4.1.3 Tmax and Tmin

According to the following result, polynomial forecasts have announced a slight decrease in maximum temperature (Tmax) of approximately 1°C compared with the initial value of 22.5°c, which was recorded in 2019 (Figure 9 and Table 3). The Pearson polynomial regression has shown an opposite behavior, claiming a potential rise in Tmax, of about 4°C (26.3°C in 2030) compared with the 2019 value. As for Tmin (Figure 10 and Table 4), the general trend is badly declining where the minimum temperature is assumed to achieve 6.7°C in 2030 where it was 8.6°C in 2019. Regarding the r polynomial regression, constructed value has shown a small decrease compared with the ordinary polynomial curve, where Tmin will reach 6.2°C by the end of 2030. The total difference between the two curves will be greater if we extend the time interval. Except for the polynomial regression based on r correlation, all curves are suggesting a decline in temperature during the coming period. This idea was proven to be supported in some similar cases [15]. On the other hand, forecasted Tmax based on Pearson polynomial regression matches the whole idea of many studies that claim a general increase in temperature [16, 17]. Another crucial phenomenon is known as “Urban Heat Island,” which evokes the impact of urbanized areas in destabilizing the surface temperature [17, 18].

Figure 9.

Quadratic polynomial regression forecast of Tmax.

StationPoly (° = 2)Pear Corr
R2AlphaBeta 1Beta 2
10.467−4980.1735.042−0.001Invalid
2−9726.8829.776−0.002Invalid
3−14511.58714.568−0.004Invalid
4−23027.04123.057−0.006Invalid
5−2610.0732.607−0.001Valid
6−1249.6371.2810.000Invalid
731420.577−31.2950.008Valid
85659.772−5.6560.001Valid
94127.832−4.1160.001Valid
10−29400.17129.353−0.007Invalid
11−20620.93320.694−0.005Invalid
12−18063.35918.138−0.005Invalid
13−12779.39012.826−0.003Invalid
1435242.347−35.0890.009Valid
15−499.4100.5060.000Valid
16−15299.75415.350−0.004Invalid
17−21567.20621.631−0.005Invalid
18−17724.81317.840−0.004Invalid
19−17130.24917.181−0.004Invalid
20−2610.0732.607−0.001Valid
2163652.590−63.4290.016Valid
22−20955.31720.796−0.005Valid

Table 3.

Coefficient of prediction of Tmax for Mellegue catchment.

Figure 10.

Quadratic polynomial regression forecast of Tmin.

StationPoly (° = 2)Pear Corr
R2AlphaBeta 1Beta 2
10.648−15061.53615.056−0.0038Valid
2−15104.12915.116−0.0038Valid
3−16539.75416.569−0.0041Valid
4−19188.38519.222−0.0048Valid
5−17771.72617.806−0.0045Valid
6−1198.4411.210−0.0003Invalid
71745.406−1.7190.0004Invalid
8−15167.49715.148−0.0038Valid
9−21793.64621.777−0.0054Valid
10−11902.21411.955−0.0030Valid
11−13972.99513.983−0.0035Valid
12−11936.93811.972−0.0030Valid
13−11066.35211.111−0.0028Valid
14−22346.00722.350−0.0056Valid
15−20194.33020.203−0.0051Valid
16−31769.09631.718−0.0079Valid
17−19493.62719.517−0.0049Valid
18−17911.16217.991−0.0045Valid
19−15827.26815.904−0.0040Valid
20−17771.72617.806−0.0045Valid
2155076.342−54.8310.0136Invalid
2216150.051−15.9640.0039Valid

Table 4.

Coefficient of prediction of Tmin for Mellegue catchment.

2.4.2 Monthly data: the case of Zaghouan

2.4.2.1 P

According to the following results of Figure 11, it is found out that quartic polynomial regression (order (n) = 4) has predicted a general decrease in precipitation with an average loss of 20 mm/ year. In general, forecasting precipitation is always disputable. Some studies suggest, based on scientific results, a potential shortage in precipitation where the duration and intensity of drought will be intense due to an expected rise in heat. On the opposite, some places will witness a potential growth in total rainfalls due to the local climate and the geographic coordinates of the area. To summarize, we could say that warm places will become warmer leading to a shortage in precipitations, whereas wet areas will become wetter, which will intensify the amounts of rainfalls. Going back to our model, the generated trend matches the general idea that assumes a loss in forecasted rainfalls near the equator where the expected total loss of the Mediterranean areas will be around 20% [14].

Figure 11.

Quartic polynomial regression forecast of P for the case of Zaghouan.

Advertisement

3. Discussion

Considering the application of polynomial regression for the case of Mellegue watershed, it seems that the best polynomial equation is quadratic (n = 2). The average R2 for the polynomial regression is around 0.5; this value happened to rise simultaneously with the advance of polynomial degree [19]. Nonetheless, the expected results are going too far from reality.

It seems that polynomial regression cannot deal with big data. The catchment is a very large basin. So, trying to understand/study a specific phenomenon needs to collect a large amount of data that has to be taken from different stations, in such a way to cover the entire area. Homogenizing the data for a resultant average response is very helpful for the machine learning algorithm to reduce error uncertainties from voluminous data, and so polynomial regression can fit the generated weather curvature [20]. Unfortunately, these data are not efficient at all for decision-makers and environmentalists because it can generate a misinterpretation as a response for treating/dealing with the potential causes for such a phenomenon.

As we see in the previous example, choosing the right order that fits mostly with the data type is often challenging to find. This is considered as one of the major handicaps in working with polynomial regression [19]. To deal with such drawbacks, choosing the right order must not be arbitrary and needs some visualization and expert of dealing with similar data (Plot, R2, etc.). Additionally, going up in the order does generally rise the R2 coefficient, but it has a negative impact in overestimating and amplifying the forecasted results. Generally, frequently used approach consists of progressively increasing the order until it successfully fits the data. We can rely on many other famous techniques in building strategies, which are the forward selection procedure and the backward elimination [21]. Despite the application of these strategies, polynomial regression is still very sensitive to the outliers and has sometimes unanticipated turns in inappropriate directions [22]. Such an example was very clear in predicting the minimum and maximum temperatures where the forecasted models (except of the Pearson polynomial regression) showed a decrease in temperatures, which was oppose to most of the studies, indicating a general increase in heat because of global warming. This is maybe due to an incorrect interference in interpolation and/or extrapolation of the original data because of the natural instability of the climate data. This sensitivity can be also due to the fewer techniques (lack) in validating the detection of the outliers. Additionally, the existence of few outliers (one or two) in the data can badly affect the results of the nonlinear analysis [20].

Another important drawback that should be reported is the incapacity of the polynomial regression to deal with very sensitive data such as weather. Even high orders cannot keep up with original data curves. Therefore, many samplers (points) will not be taken into account by the polynomial model, which will affect the projection of forecasted weather results. Additionally, as seen in most of the previous examples, we can say that the prediction of the weather shows a linear trend that does not conform to the general feature of the weather as it behaves seasonally, and so it must have a certain sinusoidal tendency. In the two previous study areas, most of the polynomial orders are quadratic equations (n = 2). This interpretation coincides with the general assumption that the first and second order of polynomial regression are mostly used [21].

Referring to the case of Zaghouan, statistics of polynomial regression have revealed a pessimistic result. Regardless of the order (n), R2 was still too low (close to 0.1). This tells us the incapacity of polynomial regression to fit most of the 468-input data (from 1981 to 2019), and for that reason, it seems irrelevant to change the order of the equation. Accordingly, we convert the previous data to an annual form. As a result, the model corresponded differently. As shown in Figure 9, the behavior of the fourth polynomial order equation is found to be different from monthly to yearly data. We can distinguish that model performed more easily with annual data as the R2 has been rising to 0.4 compared with the monthly data (0.01) (Figure 12).

Figure 12.

Difference in the behavior of polynomial regression for monthly and yearly data (case of Zaghouan).

There are many advantages from using polynomial regression. Besides their simplicity in processing, they operate well in giving the best relationship between independent and dependent variables. Another good aspect regarding polynomial functions is their aptitude of fitting numerous curvatures, which depend mainly on the type and trend of the data. To conclude with, a polynomial function may not be so accurate, but it can generate an acceptable weather forecast. Let us say that it could be helpful to forecast the general behavior of climate at a certain specific range of time without the need of being too precise.

The Pearson correlation is revealed to be effective in evaluating the statistical relationship or the dependencies between two variables. Even though the method is based on the covariance and mathematical equation, it still not scientifically reliable to determine the strength and direction of the association based on Pearson correlation only. We see in the previous example of Mellegue catchment how the r coefficient has showed the strength of correlation between the stations. Every climate parameter has responded differently from the other. We can see, from Tables 1-4 that the correlation result of a certain climate parameter is totally different from the other (such as Tables 3 and 4 for Tmax and Tmin correlation). As a results, non-correlated values (less than 0.6) have been eliminated. We can see that the behavior of the prediction model based on r correlation has changed, compared with the ordinary polynomial model. Often, the results could not be notified like Tmin cases (Figure 10) where there were slight changes. But it can inversely transform the curve as it is shown in Figure 8 (rainfalls) and Figure 9 (Tmax). In fact, the concept of using r correlation is doubtful for this case. The correlation between stations is based on the real observed values where they described the local climate of the area (temperature, wind, precipitation, etc.). We can see how big the study area is, so it is quite normal that every station will have a specific climate behavior. In other way, station records must be heterogeneous where each station will refer to the local climate of belonging area. So, by eliminating some stations based on the r coefficient, we will alter the general behavior of the catchment as a response to a certain phenomenon (land use changes, erosion, etc.). Still, r correlation can be too useful for a weather climate by evaluating the connection between two weather parameters (under the same station or area) to study the connection of one to the other. We can use it for one restricted study area (same station, for example), which had a specific local climate that has a multi-data about weather parameters. In general, the judgment of the application of Pearson correlation will mainly depend on the study example, but it is recommended to avoid their use for too heterogeneous and/or sensitive data.

Advertisement

4. Summary and further works

Polynomial regression algorithm was outperformed by Python for weather forecasting. We want to see the performance of the model by taking into account the type of input data where it was operated on two concrete case studies. Eventually, we want to test the relationship between stations based on Pearson correlation as a way to retrieve the data and homogenizing it. The results have opened a wide debate to discuss; convenient polynomial order was revealed to be quadratic, which agreed with the general idea that most applied polynomial order is first and second degree [21]. Additionally, the model found a good capacity to fit various complex data [22]. On the other hand, the performance of polynomial regression based on Pearson correlation has altered significantly the weather prognosis accuracy. The behavior of the model has shown a drastic change by going from monthly to yearly data. Depending on the plot observation and coefficients variables, polynomial regression happened to fit more with yearly data. This is due to the simple fact that polynomial regression will operate more efficiently with moderate to low input data. The second fact is that this type of regression is too sensitive to mutable data such as seasonal climate. That circumstance was discovered in the case of Zaghouan (monthly data), where the model has not succeeded in finding the best polynomial order (R2 is too low). All these pieces of evidence may point to the insufficiency of polynomial algorithms in forecasting weather based on both monthly and yearly data or maybe due to the wrong conception of the model. Regardless of the reason, it is compulsory to look out for other alternative algorithms such as seasonal autoregressive integrated models (SARIMA), which can deal more efficiently with voluminous and unstable data. Also, we should improve the confidence of the inputs data by applying other machine learning algorithms as a way to homogenize the data and eliminate uncorrelated samples. However, this would require much more processing time and more complex algorithms, which will be deferred to future work.

References

  1. 1. National Research Council. When Weather Matters. Washington, D.C., United States: National Academy of Sciences; 2010
  2. 2. Mavromatidis LE, Bykalyuk A, Lequay H. Development of polynomial regression models for composite dynamic envelopes’ thermal performance forecasting. Applied Energy. 2013;104:379-391
  3. 3. Zjavka L. Wind speed forecast correction models using polynomial neural networks. Renewable Energy. 2015;83:998-1006
  4. 4. Bradley RA, Srivastava SS. Correlation in polynomial regression. The American Statistician. 1979;33(1):11-14
  5. 5. Ostertagová E. Modelling using polynomial regression. Procedia Engineering. 2012;48:500-506
  6. 6. McQuistan A. Using Machine Learning to Predict the Weather: Part 2 [Internet]. stackabuse.com. 2017. Available from: https://stackabuse.com/using-machine-learning-to-predict-the-weather-part-2/?fbclid=IwAR3LAOyadJ48EMsG5NuTS0dKJIuLCjoGc8YmqJmP0SNIZ0hdtTIkyIpLr40
  7. 7. Rodier JA, Colombani J, Claude J, Kallel R. Le Bassin de la Mejerdah [Internet]. 1981. 451 p. Available from: https://www.worldcat.org/title/bassin-de-la-mejerdah/oclc/469086740
  8. 8. Ameur M. Hamzaoui–Azaza F, Gueddari M. nitrate contamination of Sminja aquifer groundwater in Zaghouan, Northeast Tunisia: WQI and GIS assessments. Desalination and Water Treatment. 2016;57(50):23698-23708
  9. 9. Souissi F, Jemmali N, Souissi R, Dandurand JL. REE and isotope (Sr, S, and Pb) geochemistry to constrain the genesis and timing of the F-(Ba-Pb-Zn) ores of the Zaghouan District (NE Tunisia). Ore Geology Reviews [internet]. 2013; 55(C):1-12. DOI: 10.1016/j.oregeorev.2013.04.001
  10. 10. Roderick ML, Farquhar GD. The Cause of Decreased Pan Evaporation over the past 50 Years. Science (80- ). 2002;298(5597):1410-1411
  11. 11. Lin P. Impacts of climate change on reference evapotranspiration in the Qilian Mountains of China: Historical trends and projected changes. International Journal of Climatology. 2018; 38(7):1-14
  12. 12. Yao L. Causative impact of air pollution on evapotranspiration in the North China plain. Environmental Research [internet]. 2017;158:436-442. DOI: 10.1016/j.envres.2017.07.007
  13. 13. Ramarohetra J, Sultan B. Impact of ET0 Method on the Simulation of Historical and Future Crop Yields: A Case Study of Millet Growth in Senegal. International Journal of Climatology. 2017; 38(2):729–741
  14. 14. Hausfather Z. Explainer: What climate models tell us about future rainfall. Carbon Brief [Internet]. 2018; Available from: https://www.carbonbrief.org/explainer-what-climate-models-tell-us-about-future-rainfall
  15. 15. Bathiany S, Dakos V, Scheffer M, Lenton TM. Climate models predict increasing temperature variability in poor countries. Science Advances. 2018;4(5):1-11
  16. 16. Lapenis A. A 50-Year-Old Global Warming Forecast That Still Holds Up. Eos [Internet]. 2020. DOI: 10.1029/2020EO151822
  17. 17. Ackerman B. Temporal march of the Chicago heat island. Journal of Applied Meteorology and Climatology. 1985;24(6):547-554
  18. 18. Zhang J, Dong W, Wu L, Wei J, Chen P, Lee DK. Impact of land use changes on surface warming in China. Advances in Atmospheric Sciences. 2005;22(3):343-348
  19. 19. Edwards JR, Parry ME. On the use of polynomial regression equations as an alternative to difference scores in organizational research. Academy of Management Journal. 1993;36(6):1577-1613
  20. 20. Zaw WT, Naing TT. Modeling of rainfall prediction over Myanmar using polynomial regression. Proc - 2009 Int Conf Comput Eng Technol ICCET 2009. 2009;1:316-320
  21. 21. Shalabh IK. Polynomial regression models. In: Regression Analysis. Kanpur: Indian Institute of Technology Kanpur; 2012. pp. 1-12
  22. 22. Qiu S, Li S, Wang F, Wen Y, Li Z, Li Z, et al. An energy exchange efficiency prediction approach based on multivariate polynomial regression for membrane-based air-to-air energy recovery ventilator core. Building and Environment. 2019;149(December 2018):490-500. DOI: 10.1016/j.buildenv.2018.12.052

Written By

Okba Weslati, Samir Bouaziz and Mohamed Moncef Serbaji

Submitted: 22 September 2021 Reviewed: 17 January 2022 Published: 22 March 2022