Open access peer-reviewed chapter

Neural Networks in Nonlinear Time Series: A Subsampling Model Selection Procedure

Written By

Michele La Rocca and Cira Perna

Submitted: 23 June 2023 Reviewed: 02 August 2023 Published: 20 September 2023

DOI: 10.5772/intechopen.1002540

From the Edited Volume

Time Series Analysis - Recent Advances, New Perspectives and Applications

Jorge Rocha, Cláudia M. Viana and Sandra Oliveira

Chapter metrics overview

32 Chapter Downloads

View Full Metrics

Abstract

In this chapter, the problem of model selection in neural networks for nonlinear time series data is addressed. A systematic review and an appraisal of previously published research on the topic are presented and discussed with emphasis on a complete strategy to select the topology of the model. The procedure attempts to explain the black box structure of a neural network by providing information on the complex structure of the relationship between a set of inputs and the output. The procedure combines a set of graphical and inferential statistical tools and allows to choose the number and the type of inputs, considered as explanatory variables, by using a formal test procedure based on relevance measures and to identify the hidden layer size by looking at the predictive performance of the neural network model. To obtain an approximation of the involved statistics, the approach heavily uses the subsampling technique, a computer-intensive statistical methodology. The results on simulated data show the good performance of the overall procedure.

Keywords

  • nonlinear time series
  • feed-forward neural networks
  • model selection
  • subsampling
  • statistical inference

1. Introduction

In the last decades, artificial neural networks have received considerable attention in the literature on time series analysis and forecasting, due to their flexibility in modeling complex nonlinear mapping functions for an output variable, given the input variables. They are mainly useful when the underlying relationships are complex and cannot be modeled through a known functional form.

In a time series context, single hidden layer feedforward neural networks are the most popular and widely used network paradigm in many applications [1] due to their clear advantages with respect to alternative non-parametric techniques. Firstly, this simple neural network structure allows handling short time series (typically available in many time series contexts, such as economics, banking, and insurance applications) where lack of information hardly justifies the use of more complex deep learning structures, such as long short-time memory or convolutional neural networks. Moreover, single hidden layer feedforward neural networks are able to provide an arbitrarily accurate approximation of an unknown function of interest with any desired accuracy [2, 3]. Finally, they deliver good predictive accuracy without suffering the problems that arise when working with high-dimensional data, the so-called “curse of dimensionality”, which affects other alternative non-parametric techniques [4].

Along with the great success of neural networks, there is also growing concern about their black-box nature due to the lack of procedures, usually employed in statistical and econometric frameworks, able to provide explanations of their inner workings and input-output mappings. The problem can be traced back to the identification of a proper network topology which could improve the network interpretability. When dealing with feedforward NN, the identification of an “optimal” topology is related to the determination of the number and the type of inputs, the hidden layer size, and the activation function type.

While as regards the activation function, many studies agree on the choice of a well-behaved, i.e., bounded, monotonically increasing, and differentiable function (such as a sigmoidal function), the problem of input selection and hidden neurons identification, was and still is much debated and different solutions have been proposed.

The most widely used approaches are based on pruning (for a review see [5] and the references therein), regularization (see [6] and the recent survey in [7]), and stopped training rules (see [8]). Although these techniques can lead to good approximation results, they are based on criteria for reducing the model complexity and do not, however, frame the search for an appropriate model in a statistical perspective, capable of providing information on the plausibility of the model. From this standpoint, methods for eliminating insignificant weights based on their asymptotic distributions have been proposed [9, 10]. Focusing on single weights, which have no clear interpretation, does not give any information either on the most “significant” variables or on the hidden neurons, which is useful in any model-building strategy. Alternatively, statistical methods of model selection based on information criteria have been used [11], with particular emphasis on the Akaike information criterion, the Bayesian information criterion, and the Schwarz information Criterion. These criteria add a complexity penalty to the usual sample log-likelihood, and the model that optimizes this penalized log-likelihood is selected. However, these measures should be used carefully in a neural network context, since over-parameterized models, with heavy consequences on overfitting and poor ex-post forecast accuracy, could be selected [12]. Moreover, they do not take the uncertainty of the models into account and they do not explicitly give any information about the significant variables of the model.

In our opinion, an effective model selection strategy should be addressed in a broad statistical framework, relating it to the classic model-building approach. In a regression framework, in [13] a method to select a proper network strategy has been proposed; it is based on a comparison of the out-of-sample predictive ability of several different models through appropriate testing procedures. The same strategy has been implemented in [14] to identify the hidden layer size.

Alternatively, the problem can be addressed by pointing out the different roles in the model of the input and hidden layer neurons. The former are the explanatory variables of the model; the latter are related to the complexity of the model and have an effect on its forecast performance. In this perspective, a model selection strategy should highlight the role of input nodes useful for model identification and interpretation and should treat the hidden layer as a smoothing parameter capable of capturing the pattern in data and performing complex nonlinear patterns.

In this context, this chapter provides a systematic review and critical evaluation of an approach for selecting a neural network model for nonlinear time series proposed in [13, 14, 15, 16, 17, 18, 19, 20, 21, 22]. The procedure focuses on a comprehensive strategy that combines a set of graphical, exploratory, and inferential statistical tools and allows to select the number and the type of inputs by using a formal test procedure based on relevance measures and identifying the hidden layer size by looking at appropriate measures of predictive risk.

The whole procedure makes extensive use of subsampling, a computer-intensive statistical methodology for estimating parameters of the sampling distribution of a statistic computed from a sample. In this context, it has been used for obtaining an approximation of the sampling distributions of the involved test statistics for the input nodes and to estimate the predictive risk for the hidden layer size. The choice of this resampling technique is due to its validity for dependent data, under quite general and weak assumptions [23]. Moreover, unlike the residual bootstrap, used in the context of neural networks in [24], it is robust against misspecified models, as the neural networks are.

The chapter is organized as follows. In Section 2, we briefly illustrate the subsampling scheme for hypothesis testing. In Section 3 we describe the data-generating process and the employed neural network model. In Section 4 we present and discuss the procedure for neural network model selection, focusing on the problem of inclusion of relevant variables, the omission of the irrelevant ones, and the identification of the hidden layer size. In Section 5, we illustrate how the whole procedure can be implemented. In Section 6, we report an illustrative example of a simulated data set, in order to evaluate the performances of the proposed procedure. Some concluding remarks, in Section 7 will close the chapter.

Advertisement

2. The subsampling technique for time series data

In this approach, blocks of consecutive observations are obtained from the original observed time series, in order to take into account the dependence structure of the observed time series. Therefore, each individual subseries of the observations is considered as a valid “time sub-series” in its own right. Each block is generated by true underlying data-generating process and so, information on the sampling distribution of a given statistic can be gained by evaluating the statistic on all subseries.

Following [23], let Ytt be a stationary and mixing process governed by a probability law P, assumed to belong to a certain class of laws P. The goal is to construct an asymptotically valid test for the null hypothesis H0:PP0 versus the alternative H1:PP1 with P0P1=P. Given the observed time series Y1Y2YT, the test can be based on a test statistic such as

WT=δTwT=δTwTY1Y2YTE1

where δT is a fixed nonrandom normalizing sequence. Assume that there exists a constant wP which satisfies wP=0 under the null and wP>0 under the alternative and that wTwP in probability.

Let GTxP=PrPδTwTx the cumulative distribution function of the sampling distribution of the test statistic and assume that it converges in distribution to a given limit law at least for PP0. Naturally, as long as δT, this implies that wT0 in probability for PP0.

The subsampling scheme works as in Algorithm 1.

Algorithm 1. Subsampling for hypothesis testing.

Require: Fix the subseries length b with b<T.

 1: forj=1,,Tb+1 do

 2:  Determine Yb,j=YjYj+1Yj+b1, the subseries of b consecutive observations.

 3:  Evaluate the test statistic at the block of data Yb,j, obtaining wT,b,j

 4: end for

 5: Approximate the sampling distribution of WT by

ĜT,bx=1Tb+1j=1Tb+1IδbwT,b,jx

  where I. is the indicator function.

 6: Obtain the critical value for the test as the 1α quantile of ĜT,b, that is

gT,b1α=infx:ĜT,bx1α.

 7: Reject the null H0 at the nominal level α if and only if

WT>gT,b1α.

A subsampling p-value can be computed as

PVT,b=1Tb+1t=1Tb+1Iδbwb,tδTwT

In this case, the nominal level α test rejects the null if and only if

PVT,b<α.

The subsampling method delivers consistent results under very general and minimal assumptions, valid for both linear and nonlinear processes. The scheme requires that δbδT0, bT0 and b as T and the existence of a limiting law for the sampling distribution of the test statistic. The normalizing sequence δT could also be unknown and can be consistently estimated by a preliminary subsampling cycle [25]. Moreover, subsampling does not require any knowledge of the specific structures of the time series other than its stationarity and strong mixing property. As a consequence, it provides a robust alternative to the bootstrap techniques, whose consistency may fail unless specific regularity conditions on the model hold. This property seems to be extremely suitable when using a neural network model which, by its own nature, can be considered as a “misspecified” model, being an approximation of the underlying functional relationship [26]. Finally, the subsampling can also be extended to heteroskedastic time series [27] and diverging statistics [28].

A major drawback to applying subsampling is that the block size b must be chosen in applications. This parameter is related to the amount of dependence assumed in the series. If the blocks are long, a poor estimate of the distribution of the statistic could be obtained whereas too short blocks could not preserve, in the resampled series, the original dependence in the data. However, note that, for small and moderate sample sizes, the block size could critically affect the subsampling accuracy but it seems to be not particularly critical for large sample sizes since asymptotic results are still valid for a broad range of values for b [23].

Various methods for choosing the block size have been proposed. These are calibration, minimum volatility, interpolation, and extrapolation methods (see Chapter 10 in [23]). In the following, we focus on the use of the minimum volatility method, which works under very general conditions.

The procedure is illustrated in Algorithm 2.

Algorithm 2. Minimum volatility method for selecting the block size b in subsampling.

Require: Fix α.

Require: Fix the values bsmall and bbig.

 1: for For any integer b: bbsmallbbigdo

 2:  Compute the subsampling quantile gT,b1α using Algorithm 1

 3:  Fix a small integer k

 4:  Compute VIb as a measure of variability of the values gT,bk1αgT,b+k1α.

 5: end for

 6: Pick the value b corresponding to the smallest volatility. That is

b=minbVIb

The value gT,b1α can be used in line 7 of the Algorithm 1, as a critical value of the test.

As remarks, note that the range of b values, determined by bsmall and bbig, is not very important, as long as it is not too narrow. Moreover, the integer k, introduced in line 3 of Algorithm 2 seems not to affect the determination of b; it can be fixed at k=2 or k=3 [23]. Finally, in line 4 of Algorithm 2, as a measure of variability standard deviation or other robust alternatives, such as the mean absolute deviation, can be employed.

Advertisement

3. The DGP and the neural network model

Let Ytt, a process modeled as:

Yt=gXt+εtE2

where YtXt is a stationary, α-mixing sequence and Xt=X1tXdt is a vector of d random variables possibly including explanatory variables, lagged explanatory variables and lagged values of Yt. The unknown function g is assumed to be a continuously differentiable function defined on a compact subset of Rd. This data-generating process is very general, incorporating, in particular cases, regression models with lagged variables and dependent errors as well as autoregressive nonlinear models for time series.

When it is not possible to postulate a parametric model for the function g., the use of non-parametric methods is required. In this context, neural networks have proved to be powerful tools due to their great flexibility and their capability of providing a model which fits any data with an arbitrary degree of accuracy.

A feed-forward neural network NNdr to approximate the function g. can be defined as

fxtθ=k=1rckϕj=1dakjxjt+ak+c0E3

where x=x1xd is the vector of the d input variables, akj is the weight of the connection between the j-th input neuron and the k-th neuron in the hidden level; c(k), k=1,,r is the weight of the link between the k-th neuron in the hidden layer and the output; ak0 and c0 are respectively the bias term of the hidden neurons and of the output; ϕ. is the activation function of the hidden layer. We define θ=c0c1cra1a2ar where ai=ai0ai1aid with θΘRrd+2+1.

As usual in the neural network literature, a sigmoidal function, such as the logistic or hyperbolic tangent, is assumed for the hidden layer. In this case, a single-layer neural network can arbitrarily closely approximate the unknown function as well as its derivatives, up to a given order (provided that they exist), as measured by a proper norm [29].

Moreover, it is possible to use an approximation result [30] according to which, if g. is differentiable on a compact set, feedforward networks with one layer of sigmoidal nonlinearities achieve an integrated squared error of order O1/r, being r the number of the hidden nodes. That is, there exists a parameter vector θ such that:

gxfxθ2Cg2rE4

where Cg>0 is a proper chosen constant.

Once the neural network topology has been fixed, the estimation of the network weights (learning) has to be obtained. If Xzii=12T, with zi=Yixi is a random sample of size T, the estimate of the parameter vector θ can be obtained as:

θ̂T=argminθΘi=1TqYifxiθE5

where Θ is the parameter space and q is a proper chosen loss function.

Under general regularity conditions, if π denotes the joint distribution of the random vector zT=YxT, the vector θ̂n exists and converges almost surely to θ0, given by:

θ0=argminθΘq(y,fxθz,E6

if the integral exists and the optimization problem has a unique solution vector interior to Θ. The latter condition is required to avoid many distinct weight vectors produce identical network outputs, which could be a serious challenge when dealing with this type of model. However, sufficient conditions to guarantee the uniqueness of θ0 a suitable parameter space Θ have been proposed in the specialized literature [31] and are generally assumed everywhere.

However, even if neural networks enjoy numerous properties, including that of being universal approximators, and many efficient estimation algorithms are available in the specialized literature, they face a challenging issue, related to the lack of consolidated techniques capable of interpreting the model.

The over-parameterized nature of these models, due to the parameters often exceeding the size of the training dataset, together with the lack of information on the relationship between the input level and the output, which does not provide any understanding of the structure of the function that is approaching, transforms neural networks into black boxes, greatly limiting their usefulness. These problems require making systems explainable and interpretable, transforming the black box into a white box and allowing for a clear explanation of the underlying relationships captured by the neural network models.

In our opinion, a possible solution could be to address the choice of a suitable network topology in the classical model selection approach. In this context, in the following, we review a strategy proposed in [13, 14, 15, 16, 17, 18, 19, 20, 21] for regression models and extend it to the case of time series data.

Advertisement

4. Network selection procedure

The procedure highlights the different roles of the input neurons with respect to the hidden ones. The latter are explanatory model variables and, as a consequence, are important for model identification and interpretation; the hidden layer size can be considered as a smoothing parameter that accounts for the trade-off between estimation bias and variability.

Following this standpoint, the problem of identifying the number and the type of inputs can be framed in a statistical and econometric perspective using a formal test procedure to evaluate the inclusion of irrelevant variables or the omission of relevant ones. This approach has been proposed in [32, 33] in the case of iid case and extended in [14] for dependent data. On the contrary, the hidden layer size can be determined by looking at procedures of the predictive performance of the model.

4.1 The selection of the input: inclusion of irrelevant variables

Removing the irrelevant variables is a crucial part of the model-building process, impacting the stability and the complexity of the final model. The most obvious reason is that some variables may not be related to the outcome and should be removed following the parsimony principle, obtaining an improvement of the interpretability of the final model and, by reducing the variance, a model with better predictive ability.

The approach we focus on is based on relevance measures for the input variables. In order to remove irrelevant variables we use a stepwise selection rule which involves the definition of a variable’s relevance to the model, the estimation of the sampling distribution of the relevance measure and a testing procedure to verify the hypothesis that the variable is irrelevant.

Following [15, 16, 34], let fixθ be the partial derivative of fxθ with respect to xi; that is, x:

fixθ=fxθxi.E7

A measure of the relevance of a variable Xj to the model can be based on the expectation of some function of the derivatives of the neural network. Therefore, it can be defined as:

RMiθ=EhfiXtθE8

where h is a proper chosen function and E is the expected value w.r.t. the probability measure of the vector of the explanatory variables. We can derive the measures proposed by [33] by choosing for example the average derivative (hx=x); the absolute average derivative (hx=x); the square average derivative (hx=x2) or the maximum and minimum derivative (hx=maxx and hx=minx).

The logic of this proposal is the same as that used in linear models in which the relevance of a variable is measured by its coefficient, which is also the magnitude of the partial derivative of the dependent variable. However, it takes into account that in the nonlinear case, the partial derivative is not a constant but it varies through the range of the independent variables.

To select the set of variables to be tested as irrelevant, simple graphical exploratory tools could be used. They are based on plots of the derivatives and plots of the relevance measures for each single lag. Values of the derivatives and of the relevance measures close to zero designate the corresponding lag as a potential variable in the set of irrelevant ones.

A formal test can be implemented as follows.

Let X0=xiiI0 be the set of variables to be tested as irrelevant to the model and let:

mixθ=hfixθE9

The hypothesis that the variables in X0 are not relevant can be written as:

H0:m=iI0EmiXtθ=0E10

The null H0 can be tested by using the statistic,

m̂T=T1iI0t=1TmiXtθ̂T=T1t=1TmXtθ̂TE11

where the parameter vector θ̂T is a consistent estimator of the unknown parameter vector θ.

The asymptotic distribution of the test statistic is not one of the familiar tabulated distributions and it is very difficult to deal with. It can be approximated by using the subsampling defined in Algorithm 1, in which in line 5, it is:

WT=δTwT=Tm̂TE12

The effectiveness of this procedure has been studied in [15] where, under the null and under quite general assumptions, the consistency of the proposed subsampling test procedure has been formally proved and the good finite sample properties have been stated by simulations.

4.2 The selection of input: omission of relevant variables

Leaving out one or more relevant variables is another crucial issue in model building. Any bias could cause the model to attribute the effect of the missing variables to those that have been included.

In order to verify if there are any omitted variables, it can be used a procedure, proposed in [33] for the iid case and extended to the case of regression models in [20]. It is based on the comparison of competing neural network models.

Suppose that f1xθ1 and f2xθ2 are two competing neural network models which differ only in the inputs and are, therefore, nested. The idea is that if there are no omitted variables the network f1 is capable of producing an output identical to that of the network f2.

The omission of relevant variables can be verified by using a discrepancy measure between the outputs of the two competing neural network models f1xθ1 and f2xθ2 defined as:

m=Ef1xθ1f2xθ22E13

Therefore, the hypothesis that the two models are equivalent and so there are no omitted variables, can be written as:

H0:m=0E14

and it can be tested by using the statistic

m̂T=1Tt=1Tf1Xtθ̂T1f2Xtθ̂T22E15

where θ̂T1 and θ̂T2 are consistent estimators of, respectively, θ1 and θ. Again, the distribution of the test statistic can be consistently estimated by using the subsampling described in Algorithm 1 in which in line 5, it is:

WT=δTwT=Tm̂TE16

The validity of this subsampling test procedure can easily be proved along the same lines as in [15].

4.3 The selection of the hidden layer size

The selection of the hidden layer size plays an important role in neural network modeling. It is related to the complexity of the model and, in our opinion, it can be considered as a smoothing parameter governing the trade-off between estimation bias and variability. In particular, if a too-low number of hidden layer neurons is chosen, an underfitting situation can occur, with serious consequences on the ability of the network to approximate the unknown target function. In this case, the neural network model will have poor performance on the training data. On the contrary, over-parameterization of the network produces overfitting, also capturing the noise present in the data, and, as a consequence, the neural network model will perform well in approximating the data used for the parameter estimations, but it will have a reduced accuracy of ex-post forecasts.

However, determining r in such a way that the neural network model balances good fitting properties and simultaneously good predictive performances is a difficult task.

In the past, practitioners generally used to determine this parameter by trial and error methods and, in this context, some rules of thumb were suggested. They are based on the assumption that the hidden size is directly proportional to the number of inputs, for a suitable choice of the proportional constant (see [1]). Hybrid methods have also been proposed; they combine pruning, growing, or empirical formulas with other methods such as genetic algorithm (see [35]) or singular value decomposition [36]. None of these methods has the theoretical rigor of revealing optimal or at least near-optimal solutions in the context of time series analysis, being “data dependent” criteria.

More recently, since the number of hidden neurons has an effect on forecast performance, methods have been proposed that take into account the forecast ability of the model. In this perspective, statistical measures such as the Akaike and Bayesian information criteria can be implemented and, besides them, it has become common practice to use cross-validation techniques. Such methods, in their classical formulations, are also used in the literature to evaluate autoregressions, but their application is not straightforward because of the inherent serial correlation and potential non-stationarity of the data. Some extensions to dependent data have been proposed (see [37, 38]) but, again, no inferential procedures are available to discriminate among the alternative models. In this context, we propose to use a statistical procedure proposed in [39] in a general set-up and implemented in [20] in nonlinear regression functions approximated by neural networks. It is based on a measure of predictive risk estimated by using the subsampling.

Let ŶT+1 be the one-step ahead predictor of YT+1 obtained by using a neural network model NNdr and suppose that the input layer size d has been fixed. We propose as a measure of predictive risk the quantity:

ΔTr=EŶT+1YT+12E17

which, by construction, depends only on r. Therefore it is reasonable to determine r as the value r̂ such that:

r̂=argminrΔT,brE18

Since ΔTr is unknown, an estimate is needed. It can be employed by using subsampling.

The procedure is implemented in Algorithm 3.

Algorithm 3. Predictive risk estimation of the hidden layer size by subsampling.

Require: Fix the value rmax.

Require: Fix the subseries length b

 1: forr=1,,rmax do

 2:  forj=1,,Tb do

 3:   Determine Yb,j=YjYj+b1, and Xb,j=XjXj+b1'the subseries of b consecutive observations.

 4:   estimate of the neural network parameter vector θ̂T,b,j by using Yb,j and Xb,j

 5:   Obtain the one-step ahead predictor Ŷt+bt by using the model θ̂T,b,t

 6:   end for

 7:   Get the subsampling estimate as Δ̂T,br=1Tb+1t=0TbŶt+btYt+b2

 8: end for

 9: Pick the value r̂ corresponding to the smallest value of Δ̂T,br. That is:

r̂=argminrΔ̂T,br.
Advertisement

5. The three steps procedure for neural network topology selection

In this section, we illustrate how to implement the whole procedure for neural network model selection. In particular, we consider the case of a time series Yt, t=1,T, which follows an autoregression model of order p, NARp:

Yt=gYt1Ytp+εtE19

where g is a possibly nonlinear function and the innovations εt are distributed as standard normal.

The starting point of the analysis is the implementation of nonlinearity tests on the original time series. Although in fact, neural networks are capable of approximating both linear and nonlinear functions, in the first case it is preferable to use the classic AR(p) models. The latter are much less complex and are equipped with well-established and consolidated model selection procedures. The rejection of linearity and the frequent impossibility of assuming a particular nonlinear functional form for the function g. justifies the use of a neural network model, whose typology has to be suitably selected.

In Algorithm 4, we present the procedure which synthesizes the relevant steps to get the best neural network approximation.

Algorithm 4. Three step procedure for model selection.

1. Select the relevant variables

 1.1 Fix r=1 and d=d where d is a proper chosen maximum number of lags.

 1.2 Estimate the model NNdr.

 1.3 Plot the derivatives and the relevance measures to identify a candidate set I0 of the irrelevant variables.

 1.4 Test if the set of variables I0 is irrelevant (by using the procedure in Section 4.1)

 1.5 Determine the ‘optimal’ value d̂.

2. Select the hidden layer size

 2.1 Fix r=r, the maximum number of hidden units.

 2.2 Estimate the models NNd̂1,,NNd̂r.

 2.3 Compute the predictive risk for each model (by using the procedure in Section 4.3).

 2.4 Choose NNd̂r̂ such that the predictive risk is minimum.

3. Check for omitted variables

 3.1 Identify a set of possibly omitted variables.

 3.2 Estimate the new neural model including the new set of variables.

 3.3 Test if the two models give the same output (by using the procedure in Section 4.2).

 3.4 If the two models are equivalent, choose the most parsimonious one.

The first step concerns the selection of relevant variables which, in the case of a NAR(p) model, becomes the choice of the relevant lags. The procedure illustrated in the section is implemented, with an appropriate minor modification to take into account the autoregressive structure of the explicative variables. In this step, the number of hidden neurons has been fixed at one. This is not particularly influential; as we will show empirically in the illustrative application in Section 6, the choice of an optimal value d seems to be not much sensitive to the value fixed for r.

The second step deals with the choice of a value for the hidden layer size. As stated in Section 4.3, we implement the prediction risk estimated by subsampling.

The third step covers the problem of the omission of relevant variables. In this case, a comparison among the alternative models, presented in Section 4.2, is implemented. By the parsimony criterion, if two models exhibit the same performance, in terms of output, the one with fewer parameters is preferable.

Finally, the tests for neglected nonlinearity on the residuals from the identified optimal model, along with tests for normality, have to be performed to verify that the nonlinear structure of the data has been correctly modeled.

Advertisement

6. An illustrative example on simulated data

To illustrate how the proposed model selection procedure works, the results of an illustrative example of simulated data will be reported. The experimental setup is based on a dataset generated by an Exponential Autoregressive model of order 2, defined as:

Yt=0.5+0.9expYt12Yt10.81.8expYt12Yt2+εtE20

where the innovations εt are distributed as standard normal.

The choice of this nonlinear model is due to its great flexibility, allowing the generation of very different time series structures. Moreover, being the skeleton of the model defined by a function in the class of continuously differentiable functions, the universal approximation theorem in [30] applies. In addition, the EXPAR(2) process is geometrically ergodic, and so stochastically stable, and it is also strongly mixing with geometrically decreasing mixing coefficients [40].

To capture the nonlinear dynamical structure of the data we approximate the data-generating process by a properly chosen neural network model NNdr whose parameters are determined by applying the procedure reported in Algorithm 4.

Note that, although the time series is nonlinear by construction, some linearity tests have still been implemented. In particular, here we refer to Teraesvirta [41] and White [42] tests. Both of them seem well suited to our context, since they assume linearity in mean under the null, and use neural networks to derive an appropriate test statistic. Moreover, the classical Jarque-Bera test for normality has been also implemented, in order to verify if the observations can be considered as a realization of a Gaussian process. The results are shown in the first column of Table 1 where it is evident that, as expected, all three tests reject the null hypothesis.

Original dataResiduals
TestStatisticp-valueStatisticp-value
Teraesvirta68.00380.000011.56670.1157
White73.10600.00001.90580.3856
Jarque-Bera84.46250.00001.12180.5707

Table 1.

Teraesvirta and White neural network tests for neglected nonlinearity on the original data and on the residuals from the “optimal” estimated neural network model NN23 and alternative network models.

To select the set of variables to be tested as irrelevant, we have specified, in line 1.1 of Algorithm 4, a value d=6, so starting with an initial tentative model NN61. The plots of the derivatives for each single lag, reported in Figure 1, show that lags 1 and 2 can be identified as possible relevant variables.

Figure 1.

Plots of the derivatives for different lags. Neural network model NN61.

This result seems to be confirmed by the plot of Figure 2 where we report the values of the relevant measure.

Figure 2.

Plots of the values of the relevance measure for different lags and different hidden layer sizes.

It is worthwhile to observe that the identification of the relevant lags is not influenced by the number of units in the hidden layer. The six plots in Figure 2, where we have considered r=1,,6, show again a clear cut between the first two lags and the others.

To confirm the exploratory identification of the first two lags as relevant, we have implemented the formal statistical test illustrated in Section 4.1 in which the subsampling has been used with an optimal block length identified as in Algorithm 2. By looking at Figure 3, in which the variance inflation indexes are reported, a value for b equal to 180 is reasonable.

Figure 3.

Values of the VI measure defined in Algorithm 2 for different values of the block length.

The results of the identification procedure are reported in Table 2. Analyzing the p-values of the tests, estimated by subsampling, we identify a neural network model with the first two lags as relevant. Again, the decision about the number and type of relevant lags remains stable for different values of the hidden layer size.

rI0{1}{2}{3}{4}{5}{6}
1Test Statistic243.02299.980.28591.83450.08420.0931
p-value0.00000.00000.76980.26920.84530.6760
2Test Statistic421.28316.861.69376.11572.61520.1861
p-value0.00000.00000.73930.25210.46040.9367
3Test Statistic698.24413.5712.11042.62031.28341.0792
p-value0.00000.00000.08040.62360.75400.7333
4Test Statistic781.28446.5612.97953.95361.07051.3363
p-value0.00000.00000.49940.72840.96220.9050
5Test Statistic728.02444.5716.37913.76553.98892.8014
p-value0.00000.00000.48480.89400.87580.8149
6Test Statistic790.27429.0516.763410.25847.315610.9096
p-value0.00000.00000.68210.68330.81610.7150

Table 2.

Values of the test statistic and p-values (in italic) for different variables set X0, (Subseries length b=180).

Now we can proceed to select a proper hidden layer size by using the predictive risk measured by the mean square error of prediction. In Figure 4 we report the distributions of the MSE estimated by subsampling, for different values of the subseries length, corresponding to different hidden layer sizes ranging from 1 to 8.

Figure 4.

Boxplots of the distributions of the predictive accuracy measure for different values of the subseries length. Each distribution refers to a given hidden layer size.

Clearly, there is no improvement in the performance by using more than 5 neurons in the hidden layer. So the optimal identified model is NN25.

To evaluate if the model selection strategy has been able to correctly identify a neural network model for the data, the neural network tests for neglected nonlinearity by Teraesvirta and White on the residuals from the identified optimal model, along with the Jarque-Bera test for normality, are reported in Table 1. Clearly, the tests do not refuse the null and so the nonlinear structure of the data seems to be correctly modeled. Moreover, the residuals can be considered as the realization of a Gaussian process.

To check if there are omitted variables in the identified model, we also compared neural models with different lag structures. The results are reported in Table 3. Clearly, neural network models with more than two neurons in the input layer seem to be equivalent to the chosen one. Therefore, there are no omitted lags. Observe that, on the contrary, if we had used a network including just one input neuron (corresponding to the first lag) there would have been better models obtained by including more lags.

ModelNN15NN25
NN255.1924 (0.0000)
NN355.2106 (0.0000)0.0617 (0.3801)
NN455.0483 (0.0000)0.0724 (0.8478)
NN555.2799 (0.0000)0.1349 (0.6529)
NN655.2909 (0.0000)0.1599 (0.8977)

Table 3.

Values of the test statistics and p-values (estimated by subsampling) in parenthesis for the hypotheses of equivalence of neural network models.

Advertisement

7. Concluding remarks

In this chapter, a neural network model selection strategy has been reviewed and discussed. It highlights the different roles of the input variables and the hidden layer size. The former are selected using testing procedures based on the definition of a measure for the variable relevance to the model which allows to evaluate the inclusion of irrelevant variables as well as the omission of the relevant ones. The hidden layer size is determined by using a measure of the predictive risk. The whole procedure uses subsampling to obtain the distribution of the involved statistics. The procedure has been applied to an illustrative example of simulated data. It seems to be able to detect correctly the set of input variables and a proper hidden layer size. Clearly, joint usage of neural network models and subsampling is usually quite demanding from a computational point of view. In any case, once fixed the subseries length, the proposed procedure takes just a few seconds on a desktop PC.

References

  1. 1. Zhang G, Patuwo BE, Hu MY. Forecasting with artificial neural networks: The state of the art. International Journal of Forecasting. 1998;14(1):35-62
  2. 2. Cybenko G. Approximation by superpositions of a sigmoidal function. Mathematics of Control, Signals and Systems. 1989;2(4):303-314
  3. 3. Hornik K. Some new results on neural network approximation. Neural Networks. 1993;6(8):1069-1072
  4. 4. Poggio T, Mhaskar H, Rosasco L, Miranda B, Liao Q. Why and when can deep-but not shallow-networks avoid the curse of dimensionality: A review. International Journal of Automation and Computing. 2017;14(5):503-519
  5. 5. Blalock D, Gonzalez Ortiz J, Frankle J, Guttag J. What is the state of neural network pruning? Proceedings of Machine Learning and Systems. 2020;2:129-146
  6. 6. Reed R. Pruning algorithms-a survey. IEEE Transactions on Neural Networks. 1993;4(5):740-747
  7. 7. Tessier H. Neural network pruning 101: All you need to know not to get lost Towards Data Science. 2021. Available from: https://towardsdatascience.com/ neural-network-pruning-101-af816aaea61
  8. 8. Prechelt L. Early stopping-but when? In: Orr GB, Miller K-R, editors. Neural Networks: Tricks of the Trade. Berlin, Heidelberg: Springer; 2002. pp. 55-69
  9. 9. Cottrell M, Girard B, Girard Y, Mangeas M, Muller C. Neural modeling for time series: A statistical stepwise method for weight elimination. IEEE Transactions on Neural Networks. 1995;6(6):1355-1364
  10. 10. Anders U, Korn O. Model selection in neural networks. Neural Networks. 1999;12:309-323
  11. 11. Panchal G, Ganatra A, Kosta YP, Panchal D. Searching most efficient neural network architecture using Akaike’s information criterion (AIC). International Journal of Computer Applications. 2010;1(5):41-44
  12. 12. Qi M, Zhang G. An investigation of model selection criteria for neural network time series forecasting. European Journal of Operational Research. 2001;132(3):666-680
  13. 13. La Rocca M, Perna C. Resampling techniques and neural networks: some recent developments for model selection. In: Atti della XLIII Riunione Scientifica SIS; 14–16 Giugno 2006. Torino(IT); 2006. pp. 14-16
  14. 14. La Rocca M, Perna C. Modelling complex structures by artificial neural networks. In: Proceedings of Knowledge Extraction and Modelling (KNEMO06), IASC INTERFACE IFCS Workshop; 4–6 September 2003. Capri(IT); 2003. pp. 4-6
  15. 15. La Rocca M, Perna C. Variable selection in neural network regression models with dependent data: A subsampling approach. Computational Statistics & Data Analysis. 2005;48(2):415-429
  16. 16. La Rocca M, Perna C. Neural network modeling by subsampling. In: Cabestany J, Prieto A, Sandoval F, editors. Computational Intelligence and Bioinspired Systems. Berlin, Heidelberg: Springer; 2005. pp. 200-207
  17. 17. La Rocca M, Perna C. A multiple testing procedure for input variable selection in neural networks. In: Proceedings of the 13th European Symposium on Artificial Neural Networks (ESANN2005); 27–29 April 2005. Bruges, Belgium; 2005. pp. 173-178
  18. 18. La Rocca M, Perna C. Neural network modelling with applications to euro exchange rates. In: Kontoghiorghes E, Rustem B, Winker P, editors. Computational Methods in Financial Engineering: Essays in Honour of Manfred Gilli. Springer: Berlin-Heidelberg; 2008. pp. 163-189
  19. 19. La Rocca M, Perna C. A two-step procedure for neural network modeling. Quaderni di Statistica. 2012;14:145-148
  20. 20. La Rocca M, Perna C. Designing neural networks for modeling biological data: A statistical perspective. Mathematical Biosciences & Engineering. 2013;11(2):331-342
  21. 21. La Rocca M, Perna C. Model selection for neural network models: A statistical perspective. In: Dehmer M, Emmert-Streib F, Pickl S, editors. Computational Network Theory: Theoretical Foundations and Applications. New Jersey: Wiley-Blackwell; 2015. pp. 1-27
  22. 22. La Rocca M, Perna C. Opening the black box: Bootstrapping sensitivity measures in neural networks for interpretable machine learning. Stat. 2022;5(2):1-18
  23. 23. Politis DN, Romano JP, Wolf M. Subsampling. New York: Springer Science & Business Media; 1999. p. 347
  24. 24. Fildes R. A new bootstrapped hybrid artificial neural network approach for time series forecasting. Computational Economics. 2020;2020:1355-1383
  25. 25. Bertail P, Politis DN, Romano JP. On subsampling estimators with unknown rate of convergence. Journal of the American Statistical Association. 1999;94(446):569-579
  26. 26. White H. Estimation, Inference and Specification Analysis. Cambridge: Cambridge University Press; 1996. p. 380
  27. 27. Politis DN, Romano JP, Wolf M. Subsampling for heteroskedastic time series. Journal of Econometrics. 1997;81(2):281-317
  28. 28. Bertail P, Haefke C, Politis DN, White H. Subsampling the distribution of diverging statistics with applications to finance. Journal of Econometrics. 2004;120(2):295-326
  29. 29. Hornik K, Stinchcombe M, White H, Auer P. Degree of approximation results for feedforward networks approximating unknown mappings and their derivatives. Neural Computation. 1994;6(6):1262-1275
  30. 30. Barron AR. Universal approximation bounds for superpositions of a sigmoidal function. IEEE Transactions on Information Theory. 1993;39(3):930-945
  31. 31. Ossen A, Rüger SM. An analysis of the metric structure of the weight space of feedforward networks and its application to time series modeling and prediction. In: Proceedings of the 4th European Symposium on Artificial Neural Networks (ESANN1996); 24–26 April 1996. Bruges, Belgium; 1996. pp. 315-322
  32. 32. Baxt WG, White H. Bootstrapping confidence intervals for clinical input variable effects in a network trained to identify the presence of acute myocardial infarction. Neural Computation. 1995;7(3):624-638
  33. 33. White H, Racine J. Statistical inference, the bootstrap, and neural-network modeling with application to foreign exchange rates. IEEE Transactions on Neural Networks. 2001;12(4):657-673
  34. 34. Giordano F, La Rocca M, Perna C. Input variable selection in neural network models. Communications in Statistics-Theory and Methods. 2014;43(4):735-750
  35. 35. Stathakis D. How many hidden layers and nodes? International Journal of Remote Sensing. 2009;30(8):2133-2147
  36. 36. Gao P, Chen C, Qin S. An optimization method of hidden nodes for neural network. In: 2010 Second International Workshop on Education Technology and Computer Science; 6–7 March 2010. Wuhan, China; 2010. pp. 53-56
  37. 37. Bergmeir C, Benítez JM. On the use of cross-validation for time series predictor evaluation. Information Sciences. 2012;191:192-213
  38. 38. Bergmeir C, Hyndman RJ, Koo B. A note on the validity of cross-validation for evaluating autoregressive time series prediction. Computational Statistics & Data Analysis. 2018;120:70-83
  39. 39. Fukuchi JI. Subsampling and model selection in time series analysis. Biometrika. 1999;86(3):591-604
  40. 40. Györfi L, Härdle W, Sarda P, Vieu P. Nonparametric Curve Estimation from Time Series. Vol. 60. Berlin: Springer; 2013
  41. 41. Teräsvirta T, Lin CF, Granger CW. Power of the neural network linearity test. Journal of Time Series Analysis. 1993;14(2):209-220
  42. 42. Lee TH, White H, Granger CW. Testing for neglected nonlinearity in time series models: A comparison of neural network methods and alternative tests. Journal of Econometrics. 1993;56(3):269-290

Written By

Michele La Rocca and Cira Perna

Submitted: 23 June 2023 Reviewed: 02 August 2023 Published: 20 September 2023