Open access peer-reviewed chapter

Applications of the H-Principle of Mathematical Modelling

Written By

Agnar Höskuldsson

Submitted: 13 April 2016 Reviewed: 04 October 2016 Published: 26 April 2017

DOI: 10.5772/66153

From the Edited Volume

Advances in Statistical Methodologies and Their Application to Real Problems

Edited by Tsukasa Hokimoto

Chapter metrics overview

1,354 Chapter Downloads

View Full Metrics

Abstract

Traditional statistical test procedures are briefly reviewed. It is pointed out that significance testing may not always be reliable. The author has formulated a modelling procedure, the H-principle, for how mathematical modelling should be carried out in the case of uncertain data. Here it is applied to linear regression. Using this procedure, the author has developed a common framework for carrying out linear regression. Six regression methods are analysed by this framework, two stepwise methods: principal component regression, ridge regression, PLS regression and an H-method. The same algorithm is used for all methods. It is shown how model validation and graphic analysis, which is popular in chemometrics, apply to all the methods. Validation of the methods is carried out by using numerical measures, cross-validation and test sets. Furthermore, the methods are tested by a blind test, where 40 samples have been excluded. It is shown how procedures in applied statistics and in chemometrics both apply to the present framework.

Keywords

  • linear regression
  • common framework for linear regression
  • stepwise regression
  • ridge regression
  • PLS regression
  • H-methods

1. Introduction

Regression analysis is the most studied subject within theoretical statistics. Numerous books have been published. Advanced program packages have been developed that make it easy for users to develop and study advanced models and methods.

Advanced program packages such as SAS and SPSS have been used by students since the 1970s and the program packages have become very advanced. The user fills out a ‘menu’ similar to those at restaurants. The program then carries out the analysis as requested providing the users’ possibility of carrying out highly advanced analysis of the data. Users rely upon that they can make interpretation of the output in the way they learned at school. For example, if a variable/factor is significant, its presence in the model improves the predictions derived from the modelling results.

Today measurement instruments are becoming more and more advanced, e.g. optical instruments are becoming popular in industry. They may give thousands of values each time a sample is measured. In applied research, the tendency is to include as much information as possible in order to cover possible alternatives of the situation. When adding interactions or non-linear terms, we also see data having hundreds or thousands of variables.

There are three basic challenges in applied data analysis, which are as follows: (1) Data in applied sciences and industry are typically generated by instruments that produce many variables. In these cases, the data have latent structure, which means that the data values are geometrically located in a low-dimensional space. However, mathematical models and methods usually assume that the X-data have full rank. In these cases, the models are often incorrect and may give imprecise results.

(2) Typically, scientists develop solutions that are optimal or unbiased. This is a natural approach, because the derived solutions have important properties, when data satisfy the given model. Simulations based on the model confirm the optimality, but often in practice the data often do not satisfy the assumptions of the model. Forcing an optimal solution on the data may give results that have bad or no prediction ability. A better solution may then be obtained by relaxing on the optimality or unbiasedness of the solution.

(3) It is a tradition to base results of regression analysis on testing the significance of the variables in the model. However, the results may not always be reliable. The influence of a variable can be so small that it may be of no importance, see Section 3.3. At professional organizations, it is considered to be serious problem for the researchers in the field.

H-principle is a prescription of how a solution to a mathematical model should be generated, when data are uncertain. It is proposed that the determination of the solution should be carried out in steps and compute the improvement in the solution and the associated precision at each step. It is suggested that the solution at each step should be an optimal balance between the improvement and the associated precision. Determination of the solution continues as long as it is supported by the given (uncertain) data.

The author has developed a framework linear regression, which is inspired by the H-principle. Using this framework, the same algorithm is applied for carrying out different types of regression analysis. Most regression methods based on linear algebra can be carried out within this framework. Here, we carry out the analysis of six different methods. Numerical and graphic analyses of the results are the same for all the regression methods.

In Section 2, we specify the used notation and briefly describe the used data. In Section 3, we consider some basic issues in modelling data. The background for the H-principle is treated. In Section 4, we consider the latent variable regression model. In Section 5, we present the H-principle and show some examples of its usage. In Section 6, we present a common framework for linear regression. In Section 7, we discuss model validation. This is an important topic, but we only consider essential aspects that are used in the analysis of the six methods. In Section 8, we present the results for six different regression methods: (1) stepwise (forward) regression maximising covariance, (2) stepwise (forward) regression maximising R2,

(3) principal component regression, (4) ridge regression,

(5) PLS regression and (6) an H-method.

Section 9 briefly discusses the results presented. In Section 10, we mention application of the H-principle to multi-block and path modelling, non-linear modelling and extension to multi-linear algebra. Section 11 presents conclusions.

Advertisement

2. Notation, data and scaling

Matrices and vectors are denoted by bold letters; matrices by upper case and vectors by lower case. In order to facilitate the reading of the equations, different types of indices are used for the steps and for the matrices/vectors. The letters a and b are related to the steps in the algorithm. The letters i, j and k are used for the indices within a matrix/vector. It is assumed that there is given instrumental data X, N × K matrix, and response data Y, N × M matrix. The regression data are denoted by (X, Y). In some cases, only one y-variable, y, is used. This is done to simplify the equations. It is assumed that data are centred, which means that average values are subtracted from each column of X and Y. This also makes it easier to read the equations. xj is the jth column of X and xi is the ith row of X = (xij). A latent variable τ is a linear combination of the original measured variables,

τ=a1x1+a2x2++aKxKE1

The data used here are from an optical instrument. The instrument gives 1200 values at each measurement. However, technical knowledge of the instrument suggests that only 40 values should be used for determining the substance in question, the y-values. Two hundred samples are measured. Forty samples are put aside for a blind test. Thus, the calibration analysis is based on 160 samples. This gives X as 160 × 40 matrix and y a 160 vector. These data are challenging and represent a common situation, when working with optical instruments (FTIR, NIR, fluorescence etc.).

It is sometimes important to determine if data should be scaled or not. Scaling can be obtained by multiplying X and Y by diagonal matrices. If C1 and C2 are diagonal matrices, the scaling of X is done by the transformation, X←(XC1) and of Y by Y← (YC2). The linear least squares solution for (X, Y) is given by B=(XTX)1XTY. This solution can be obtained from the linear least squares solution for scaled data, B1, as follows,

B=C1[{(XC1)T(XC1)}1(XC1)T(YC2)]C21=C1B1C21E2

This shows that when computing the linear least squares solution, we can work with the scaled data. The original solution is obtained by scaling ‘back’ as shown in the equation. We use this property also, when we compute the approximate solution. The effect of scaling is better numerical precision. Scaling is necessary for the present data. If data are not scaled (e.g. to unit variance), numerical results beyond dimension of around 15 may not be reliable. The reason is that Eqs. (20) and (23) are sensitive to small numerical values. Scaling is much debated among researchers. When scaling is used, one must secure that all variables have values above the ‘noise’ level of the instrument. This is a difficult topic, which is not considered closer here. If original data follow a normal distribution, the scaled ones will not. However, for large sample number like given here, the differences will be negligible.

Advertisement

3. Linear regression

3.1. Traditional regression model

It is assumed that there are given instrumental data X and response data y. A linear regression model is given by

y=β1x1+β2x2++βKxK+εE3

The x-variables are called independent variables and the y-variable is the dependent one. When the parameters β have been estimated as b, the estimated model is now

y=b1x1+b2x2++bKxKE4

When there is given a new sample x0 = (x10, x20, …, xK0), it gives the estimated or predicted value y0 = b1 x10 + … + bK xK0. There can be many estimates for the regression coefficients. Therefore, Greek letters are used for the theoretical parameters and Roman letters for the estimated values. It is common to use the linear least squares method for estimating the parameters. It is based on minimizing the residuals, (y–Xβ)T(y–Xβ) → minimum, with respect to β. The solution is given by

b=(XTX)1XTy.E5

It is common to assume that the residuals in Eq. (3) are normally distributed. This is often written as y~N(, σ2). It means that the expected value of y is E(y) = and the variance is Var(y) = σ2I, where I is the identity matrix. The linear least squares procedure coincides with the maximum likelihood method in case the data follow a normal distribution. Assuming the normal distribution, the parameter estimate b has the variance given by

Var(b)=σ2(XTX)1.E6

Here σ2 is estimated by

σ2Σ1Nei2/(NK)E7

where the residuals, ei’s, are computed from e = y – Xb.

3.2. Assumptions and properties

It is assumed that the matrix XTX has an inverse. Furthermore, it is assumed that the samples are independent of each other. In case the model (Eq. (3)) is correct and data follow the normal distribution, the estimate (Eq. (5)) has some important properties compared with other possible estimates:

  1. It is unbiased, E(b) = β.

  2. The estimate (Eq. (5)) has among all linear unbiased estimates of β the smallest variance. It means that if b1 is another unbiased estimate of β, then Var(b1) – Var(b) is a semi-positive definite matrix.

There is a general agreement that the variance matrix Var(b) should be as small as possible. The matrix (XTX)–1, the precision matrix, shows how precise the estimates b are. The properties (a) and (b) state that assuming the linear model (Eq. (3)), the solution (Eq. (5)) is, in fact from a theoretical point of view, the best possible one. It often occurs that the model (Eq. (3)) contains many variables. Interpretation of the estimation results (Eq. (4)) is an important part of the statistical analysis. Therefore, it is common to evaluate the parameters with the aid of a significance test. If a variable is not significant, it may be excluded. A parameter associated with a variable is commonly evaluated by a t-test. A t-test of a parameter bi is given by

t=biVar(bi)  bi(s2×sii)E8

Here s2 is given by Eq. (7) and sii is the ith diagonal element of (XTX)–1.

This is the motivation that the program packages such as SAS and SPSS in statistics compute the linear least squares solution in the case of linear regression analysis as the initial solution. Significance of the parameters in Eq. (4) is shown using Eq. (8).

The problem in using the least squares solution appears when the precision matrix becomes close to singular. If the computer program (SAS or SPSS) detects that the computation of b may be imprecise due to close to singularity of the precision matrix, the user is informed that the estimates b may be imprecise. However, the information is related to the judgement of the situation and the numerical precision of the computer. There are practical issues long before the question of the numerical uncertainties of the estimates, b’s.

Consider two examples. Suppose that X is N × 2 and that x1 = c x2 + δ, for some constant c, where |δ| < 10–5. We may be able to compute (XTX)–1. However, inference on the model (Eq. (4)) may be uncertain or incorrect. At the other example, suppose that X is N × 40, but that practical rank is 15. Assume that XTX = X1TX1 + X2TX2, where X2 = (x2,1 x2,2 … x2,40). If all x2,i i = 1,…,40 are small, say |x2,i| < 10–5, inference from Eq. (4) may be uncertain or incorrect. The first example may not be realistic. However, the second one is realistic. It is common that data in applied sciences and industry are of reduced practical rank. In these cases, the model (Eq. (3)) is incorrect and leads to uncertain or incorrect results.

3.3. Stepwise linear regression

We shall consider closer the procedure of stepwise linear regression. We do that by using the Cholesky factorization XTX = FFT, where F is lower triangular. Then we can write the columns of the data matrix X as

x1=F11t1x2=F21t1+F22t2xK=FK1t1+FK2t2++FKKtK E9

The vectors (ti) are mutually orthogonal and of length 1, tiTtj = δij. The significance of xK, when x1, x2, …, xK–1 are given, is computed from Eq. (8) with

bK=yT(FKKtK)FKK2=(yTtK)FKK and sKK=1/FKK2E10

This gives

t=(yTtK)sE11

This shows that the t-test for the significance of xK is independent of the size FKK. When the selection of a new variable is carried out among many variables (e.g. 500), there is a considerable risk that FKK is so small that the marginal effect of xK is of no importance although it is being declared as statistically significant. The issue is that the user of program packages is not informed of, if a significant variable improves the prediction of the response variable or not.

3.4. Variance of regression coefficients

It is instructive to study closer the variance of the regression coefficient in the linear least squares model, Eq. (6). It can be written as

Var(b)[yTyyTX(XTX)1XTy]×[(XTX)1]/(NK)E12

An important objective of the modelling task is to keep the value of Eq. (12) as small as possible. Equation (12) is a product of two parts. For these two parts it can be shown that, when a new variable is added to the model

  1. the residual error, [yTyyTX(XTX)1XTy], always decreases

  2. the precision matrix, (XTX)1, always increases

(In theory these measures can be unchanged, but in practice these changes always occur). Note that the two terms, (i) and (ii), are equally important and appear in a symmetric way in Eq. (12). If the data are normally distributed, it can be shown that

  1. the residual error, [yTyyTX(XTX)1XTy],

  2. the precision matrix, (XTX)1,

are stochastically independent. This means that the knowledge of one of them does not give any information on the other one. If, for instance, a certain fit has been obtained, then there is no information on (b). In order to know the quality of predictions, we must be compute (b) and use Eq. (12). Knowing (a), the results concerning the precision matrix can be good or bad. Program packages only use the t-test or some equivalent measure to evaluate to test significance of the variables/components. There is no information on the precision matrix. We must compute the precision matrix together with the significance testing in order to find out how well the model is performing. However, a modelling procedure must include both (a) and (b) in order to secure small values of Eq. (12). The procedure must handle both the decrease of (i) and the increase of (ii) during the model estimation.

Advertisement

4. Latent variable model

The measurement data in applied sciences and industry often represent a ‘system’, where variables are dependent on each other from chemical equilibrium, action-reaction in physics and physical and technical balance. Geometrically, the samples are located in a low-dimensional space. In these cases, the model (Eq. (3)) is incorrect (apart from simple, designed laboratory experiments) and the nice theory above may not applicable. Why is this the case? When a new sample is available, we cannot automatically insert it into Eq. (4) and compute the y-value. We must secure that the new sample is located geometrically along the samples used for analysis, the rows of X. A correct model for this kind of data is

y=α1τ1+α2τ2++αAτA+ε.E13

Here (τa) are latent variables and (αa) are the regression coefficients on the latent variables. The regression task is both to determine the latent variables, τ’s, and to compute the associated regression coefficients. The resulting Eq. (13) is then converted into Eq. (4) using Eq. (26). The model (Eq. (13)) is called latent structure linear regression. It is sometimes difficult to make ‘good’ interpretation of the variables, when using latent variables. However, by using appropriate modelling procedure, it is possible to obtain a fairly good interpretation of the variables (see Section 5.4).

Advertisement

5. The H-principle of mathematical modelling

5.1. Background

In the 1920s, there were large discussions on the measurement aspects of quantum mechanics. W. Heisenberg pointed out that there are certain magnitudes that cannot be determined exactly at the same time. His famous uncertainty inequality states that there is a lower limit to how well conjugate magnitudes can be determined at the same time. The position and momentum of an elementary particle is an example. For that example, the inequality is

Δ(position)×Δ(momentum)constantE14

The lower limit of the inequality is related to the Planck’s constant of light. These considerations are based on physical theory. In practice, it means that there are some restrictions on the outcome of an experiment. The results may depend on the instrument and the phenomenon being studied. The restrictions are detected by the application of the measurement instrument. The uncertainty inequality and the associated theory are a kind of guidance, when carrying out experiments.

5.1.1. The H-principle

When modelling data, there is an analogous situation. Instead of a measurement instrument we have a mathematical method. The conjugate magnitudes are ΔFit and ΔPrecision and they cannot be controlled at the same time. It is recommended to carry out the modelling in steps and at each step evaluate the situation as prescribed by the uncertainty inequality. At each step, it is assumed that there is given a weight vector. The recommendations of the H-principle of mathematical modelling are the following (expressed in the case of regression analysis):

  1. Carry out determining the solution in steps. You specify how you want to look at the data at this step by formulating how the weights are computed.

  2. At each step compute

    1. expression for improvement in fit, Δ(Fit)

    2. the associated prediction, Δ(Precision)

  3. Compute the solution that minimises the product Δ(Fit) × Δ(Precision)

  4. In case the computed solution improves the prediction abilities of the model, the solution is accepted. If the solution does not provide this improvement, the modelling stops.

  5. The data are adjusted for what has been selected; restart at 1.

Note that the H-principle is a recommendation of how to proceed in determining the solution of a mathematical problem, when data are uncertain. It is not like maximum likelihood, where a certain function is to be maximized. However, it is necessary to compute the improvement in the solution at each step and the associated precision. The optimal balance is described in Section 3 and the situation is evaluated to find out if data are following along.

5.1.2. Application to linear regression

Let us consider closer how this applies to linear regression. The task is to determine a weight vector w according to this principle. For the score vector t = Xw we have

  1. Improvement in fit: |YTt|2/(tTt)

  2. Associated variance: Σ/(tTt)

If S = XTX the adjustment (Eq. (20)) gives orthogonal score vectors, t’s. Therefore, the variance, (b), is used for the precision. Treating Σ as a constant the task is to maximize

[|YTt|2tTt]×1[1tTt]=wTXTYYTXwE15

This is a maximization task because improvement in fit is negative. The maximization is carried out under the restriction that w is of length 1, |w|=1. By using the Lagrange multiplier method, it can be shown that the task is an eigenvalue task,

XTYYTXw=λwE16

In case there is only one y-variable, the eigenvalue task has a direct solution

w=XTy|XTy|=C|C|E17

These are the solutions used in PLS regression. Thus, we can state that the PLS regression is consistent with the H-principle. H-methods take as a starting point the PLS solution and determine how the prediction aspect or the solution can be improved.

5.2. Interpretation of the modelling results

The fit obtained by the score vector t is

(yTt)2(tTt)=(CTC)2CTSC=[c12f+c22f++cK2f]2E18

where f=CTSC

The regression coefficient can be written similarly as

(yTt)(tTt)=(CTC)1.5CTSC=[c12g+c22g++cK2g]1.5E19

where =(CTSC)11.5.

At each step, we can see how much each variable contributes to the fit and the regression coefficient..

Advertisement

6. A common framework for linear regression

6.1. Background: views on the regression analysis task

There are many ways to carry out a regression analysis. Here we present a general framework for carrying out linear regression that includes most methods based on linear algebra. The basic idea is to separate the computations into two parts: the first part is concerned on how it is preferable to look at data. In practice, there can be different emphasis on the regression analysis. Sometimes it is important to determine important variables. In other cases, it may be the predictions derived that is important. The other part is the numerical algorithm to compute the solution vector and associated measures. At the first part, it is assumed that a weight matrix W = (w1, …,wK) is given. Each weight vector should be of length one, although this is not necessary. The weight vectors (wa) specify how one wants to look at the data. They may be determined by some optimization or significance criteria. They may also be determined by a standard regression method such as PLS regression. In Section 6.4, we show the choices in the case of the six regression analysis. The role of the weight vectors is only to compute the loading vectors. The other part of the computations is a numerical algorithm, which is the same for all choices of W. There is no restriction on the weight vectors except that they may not give zero-loading vector, pa0.

6.2. Decomposition of data

The starting point is a variance matrix S and a covariance matrix C. S can be any positive semi-definite matrix, but C is assumed to be the covariance, C = XTY. The algorithm is formulated for multiple y’s.

Initially S0 = S and C0 = C and B = 0. For a = 1, …, K:

Loading vector:pa=Sa1waE20
Scaling constant:da=1waTpaE21
Loadingvector:qa=Ca1TwaE22
Loading weight vector:vaE23

The loading weight vectors are computed by, see reference [1],

v1=w1,va=wad1(p1Twa)v1da1(pa1Twa)va1,a=2,3,,KE24

S is adjusted by the loading vector pa and similarly for C,

Sa=Sa1dapapaTE25
Ca=Ca1dapaqaTE26

The adjustment of S is of rank one reduction. S also reduces in size by

tr(dapapaT)=dapaTpa=waTS2wawaTSwa>0.E27

The loading weight matrix V satisfies VTP = D–1, see reference [1].

6.3. Expansion of matrices

At a = K we get SK = 0. Expanding S and C we get

S=d1p1p1T++dApApAT++dKpKpKT=PDPTE28
C=XTY=d1p1q1T++dApAqAT++dKpKqKT=PDQTE29

Inserting appropriate matrices we get

S1=d1v1v1T++dAvAvAT++dKvKvKT=VDVTE30
B=S1XTY=d1v1q1T++dAvAqAT++dKvKqKT=VDQTE31
Y^TY^=d1q1q1T++dAqAqAT++dKqKqKT=QDQTE32

If S = XTX, we can expand X and Y in a similar way. Compute a score matrix T by T = XV. This gives

X=d1t1p1T++dAtApAT++dKtKpKT=TDPTE33
Y^=d1t1q1T++dAtAqAT++dKtKqKT=TDQTE34

The score vectors are mutually orthogonal. This follows from TTT = D1. Generally, only A terms of the expansions are used, because it is verified that the modelling task cannot be improved beyond A terms. For the proof of geometric properties of the vectors in these expansions, see reference [1].

It can be recommended to use a test set, (Xt, Yt), when carrying out a regression analysis. Centring (and scaling if used) is done on the test set by using the corresponding values from (X, Y). The estimated y-values are computed as Ŷt = Xt B and the score vectors for the test set as Tt = Xt V. It is often useful to study the plots, showing columns of Y against columns of T (in stepwise regression, the plots are called added variable plots). Similarly, we can plot the score vectors of the test set, the columns of Tt, against columns of Yt.

6.4. Choices of weight vectors

The weight vectors for the six regression methods are as follows:

  1. (i) Stepwise regression, maximize covariance:

    wa(ii)=1 for Cii=maxi=1K|Ci| and=0 otherwiseE35

  2. (ii) Stepwise regression, maximize R2:

    wa(ii)=1 for(yTxii)2(xiiTxii)=maxi=1K(yTxi)2(xiTxi), and=0 otherwiseE36

  3. (iii) and (iv) Eigenvector of S:

    For S=UEUT, wa=uaE37

  4. (v) PLS regression:

    wa=Ca/|Ca|E38

    where Ca is the reduced covariance.

  5. (vi) H-method:

The weight vector of PLS regression is sorted with largest first, (w1(s),w2(s),,wK(s)). Columns of X are rearranged to match this sorting, X(s).

wa,m=(w1(s),w2(s),,wm(s),0,,0)E39

The index m is chosen, which gives the best explained variation,

ti=X(s)wa,i,(yTtm)2(tmTtm)=maxi=1K(yTti)2(tiTti)E40

For further details of this method, see reference [2]. It has been applied to different bio-assay studies, where there can be many variables (3000 or more). Comparisons with several other methods have been shown that this method is preferable to work within the bio-assay studies, see references [3, 4]. Several other methods to improve the PLS solution have been developed.

Advertisement

7. Model validation

7.1. Numerical measures

The Mallow’s Cp value and Akaike’s information measure are commonly presented as results in a regression analysis. Mallow’s Cp value is given by

Cp=|yy^A|2σ2+2ANE41

As σ2 we use the residual variance, sA2, at the maximal number of steps in the algorithm. Cp is an estimation of ‘(total means squared error)/σ2’. The interpretation of Cp is

  1. it should be as small as possible.

  2. its value should be as close to A as possible

  3. deviations of Cp from A suggests bias

Akaike’s information measure is given by

AICA=N(log(sA2)+1)+s(A+1)E42

It is an information measure that states the discrepancy between the correct model and the one obtained at step A. The number of components, A, is chosen that gives the smallest value of Eq. (32).

Both Cp and AIC have the property that they are not dependent on the given linear model, [5, 6]. Therefore, they can be used for all six methods considered here.

A t-value for the significance of a regression coefficient is given by

tvalue=(yTtA)sAE43

Here tA is the Ath score vector of unit length. The significance, p-value, of the t-value can be computed using the t-distribution. Although the assumptions of a t-test are not valid for any of the six methods, it is useful to look at the significance. One can show that theoretically this value should be larger than 2 in order to be significant.

When comparing methods, it is useful to look at the estimate for the variance, Var(b), of the regression coefficients. We compute

(trace(Var(bA)))½=(sA2a=1Ada(vaTva))½E44

We cannot use this measure to determine the dimension. However, it is useful in comparing different methods.

7.2. The covariance

Eq. (15) is equivalent to the singular value decomposition of the covariance C. Therefore, it is suggested that the modelling of data should continue as long as the covariance is not zero. The dimension of a model can be determined by finding, when C = XTy0 for reduced matrices. One procedure is to study the individual terms, (xiTy). Assume that data can be described by a multivariate normal distribution with a covariance matrix Σxy. Then, it is shown in reference [7] that the sample covariances (xiTy)/(N – 3) are approximately normally distributed. If σxi,y = 0, it is shown in reference [7] that approximate 95% limits for the residual covariance, (xiTy)/(N – 3), are given by

±1.96Nσxiσy/(N3)±1.96Nsxisy/(N3)E45

Thus, when modelling stops, it is required that all residual covariances should be within these limits. If σxi,y = 0, the distribution of the residual covariance approaches quickly the normal distribution by the central limit theorem. Therefore, it is a reliable measure to judge, if the covariances have become zero or close to zero.

Another approach is to study the total value, yTXXTy = Σi(xiTy)2. If the covariance matrix Σxy is zero, Σxy = 0, the mean, µ = E{(yTXXTy)}, and variance, σ2 = Var{(yTXXTy)}, can be computed [8]. If the covariance is not zero, Σxy0, it can be shown that E{(yTXXTy)} > µ. The upper 95% limit of a normal distribution N(µ,σ2) is µ + 1.65σ. This is used for mean and variance. In the analysis, it is checked if yTXXTy is below the upper 95% limit (a one-sided test)

yTXXTy<trace(XTX)(yTy)/N+1.652trace(XTXXTX)(yTy)2/N2E46

When this inequality is satisfied, there is an indication that modelling should stop.

In the analysis in Section 8, we use the p-value,

pvalue=P{yTXXTy>µ|N(µ,σ2)}E47

This analysis has been found useful for different types of regression analysis.

7.3. Cross-validation and test sets

In stepwise regression, a search is carried out to find the next best variable. In PLS regression, a weight vector w is determined that gives the maximal size of the y-loading vector q. When search or optimization procedures are carried out there is a considerable risk of overfitting. Evaluation of the results by standard statistical significance tests may show high significance, while other measures may show that the results are not significant. The uncertainties of predictions of new samples depend on the location of the samples in question. The further away from the centre (average values), the larger the uncertainties are. On the other hand, it is important to have large values, both large x-samples and y-values, in order to obtain stable estimates. In chemometrics, these special features of prediction are well known. A 10-fold cross-validation is often used. The samples are divided randomly into 10 groups. Nine groups are used for calibration (modelling) and the results applied to the 10th group. This is repeated for all groups so that all y-values are computed by a model, yc, that is using 90% of the samples. Experience has shown that this procedure may not always function well. As an example, one can mention the case, where the y-values have very skew distribution. There can be big difference in the results from cross-validation depending on how well large y-values are represented in the groups. It may be necessary that each group has a similar ‘profile’ as the total set of samples. For the present data, the y-values show very skew distribution (log(log(y-values)) show a normal distribution). It may be necessary that each group is representative for the whole data. This can be achieved by ordered cross-validation. Here, the samples are ordered in some way, which reflects the variation or sizes in data. In a 10-fold ordered cross-validation, the first group of samples is number 1, 11, 21, etc. of the ordered samples. The second group of samples is number 2, 12, 22, etc. In this way, each of the 10 groups of samples is representative with respect to the chosen ordering. The cross-validation is now carried out in the usual way.

As a result of cross-validation, we compute

Da=1|yyc|2/|y|2 for a=1,2,,AE48

Da is computed for each dimension. In the analysis below sorted y-values are used in the ordered cross-validation. A test set for the analysis is also selected in a similar way. Samples are ordered according to the first PLS score vector. 15% of the samples or 160×0.15 = 24 samples are used in the test set (Xt, yt). Thus, the analysis in Section 8 is based on 136 samples. Equation (38) is also used, when applying results to a test set (y = yt and yc = Xt × b).

Advertisement

8. Results for six different methods

8.1. Preliminary remarks

We have chosen here data that are challenging to work with. They are representative for many situations that industry and research projects find themselves in, when working with optical instruments. It is expected that large dimension is needed to model the data. The aim of the present work is to show how commonly used measures in applied statistics and popular chemometric procedures fit within the present framework. Therefore, we have chosen data, where there is not big difference between the six methods chosen here. We show that the different measures/procedures do not agree on what dimension should be used. Thus, the data analyst is choosing one set of measures that will typically be based on experience. It is a part of industry standard, see reference [9], to develop a calibration method. When the development work is finished, a blind test is carried out for 40 new samples. The data analyst should not know the new 40 reference values, y-values. The estimate for the new 40 samples should be evaluated by another person. This procedure is used here: the last 40 samples are put aside and the calibration analysis will be based on the other samples. When analysis has been completed for the six methods, we apply the results to the 40 samples. Optical instruments may give thousands of values for each measurement that is carried out (the one used here gives 1200 values). Experts often point out which part of the data should be used. Here it has been proposed to work with 40 variables, which all show significant correlation with the y-values. An upper limit for the dimension is chosen here to be 20. There is not a numerical problem in computing the precision matrix. The ratio between the largest and the smallest eigenvalue of S is λ140 = 9.0 × 106. However, experience suggests that the dimension should be less than 20.

8.2. Modelling results for six methods

In order to make readings of the following tables easier, we shall use the following headings for the tables.

  1. Cp, Eq. (31), where σ2 is the residual variance at dimension 20. For dimension 20, we get Cp = 19

  2. AIC, Eq. (32), where sA2 is the residual variance at dimension A

  3. the t-value Eq. (33), where sA is the residual standard deviation at dimension A

  4. two-sided p-value for the t-test

  5. the size of covariance |XTy| at dimension A

  6. one-sided p-value for the significance of (v), see Eq. (37)

  7. the estimate (Eq. (34)) for √trace(Var(bA))

  8. ordered cross-validation, samples ordered by y-values, Eq. (38)

  9. test set results, samples ordered by values of the first PLS score vector (Eq. (38)).

The improvement in fit, size of score vectors, etc. are not shown in the tables below. Only dimensions from 10 to 20 are shown. Dimensions from 1 to 9 are all significant for all methods.

8.2.1. Forward stepwise regression, maximal covariance

Here a variable is selected at each step that has maximal covariance |(xiTy)| for the reduced data (Table 1).

No.(i)(ii)(iii)(iv)(v)(vi)(vii)(viii)(ix)
1090.5–1571.912.650.00000.05180.0000.230.95770.9761
1188.0–1572.2–1.780.07650.01860.0330.240.96090.9747
1281.4–1575.72.520.01270.01210.0000.270.96400.9744
1362.1–1590.04.100.00010.01110.0000.310.96310.9760
1429.9–1618.3–5.610.00000.00700.0000.430.96260.9804
1518.31629.13.640.00030.00530.0020.460.96580.9786
1619.1–1627.3–1.090.27720.00220.6170.500.96620.9797
1718.4–1627.11.630.10450.00170.6640.610.96630.9804
1820.1–1624.4–0.600.55250.00120.7440.640.96570.9804
1917.4–1626.52.180.03070.00130.4620.780.96520.9813
2019.0–1623.8–0.590.55390.00060.8060.970.96560.9811

Table 1.

Nine measures at stepwise selection of variables having maximal covariance.

The 15th variable selection has the smallest AIC value and highly significant t-value. Therefore, it seems reasonable to choose 15 variables. Cross-validation and test set have reached the high level at dimension 10 or 11. This indicates some overfitting by choosing 15 variables.

8.2.2. Forward stepwise regression, maximal R2-value

Here a variable is selected at each step that has maximal size of (xiTy)2/(xiTxi) for the reduced data (Table 2).

No(i)(ii)(iii)(iv)(v)(vi)(vii)(viii)(ix)
1087.4–1591.73.640.00040.04110.0580.340.96680.9815
1175.4–1599.5–3.240.00140.04190.0330.330.96620.9814
1260.2–1610.5–3.690.00030.03010.0000.330.96450.9789
1349.7–1618.4–3.240.00140.00520.6220.360.96430.9798
1433.8–1632.04.010.00010.00880.0280.860.96450.9807
1530.3–1634.3–2.260.02490.00730.1172.250.96630.9794
1626.2–1637.42.400.01750.00560.3652.250.96770.9780
1723.81638.92.050.04220.00760.0232.280.97010.9778
1822.7–1639.11.760.08070.00260.5412.380.96980.9789
1920.2–1640.92.100.03700.00110.8213.330.96980.9823
2019.0–1641.31.790.07520.00210.5414.720.96840.9836

Table 2.

Nine measures at stepwise selection of variables having maximal increase in R2.

It seems appropriate to choose 16 variables. Seventeenth is at the boundary of being also significant. Cp for 16 variables indicates some bias present. Cross-validation and test set have reached high values at step 10, which indicates some overfitting.

8.2.3. Principal component regression

Here score vectors are selected that correspond to the associated eigenvalues (Table 3).

No(i)(ii)(iii)(iv)(v)(vi)(vii)(viii)(ix)
10854.5–1250.7–2.400.01750.05860.0000.150.79840.9176
11452.2–1348.6–11.080.00000.05550.0000.160.87220.9521
12421.2–1356.53.240.00140.03220.0000.200.87830.9498
13171.2–1464.5–11.720.00000.03020.0000.190.93340.9682
14172.8–1461.7–0.480.62940.01080.0000.230.92920.9674
1597.6–1509.7–7.320.00000.01080.0000.270.94210.9718
1666.2–1533.45.120.00000.00710.0000.320.95080.9729
1727.31568.86.210.00000.00530.0000.350.95660.9729
1829.2–1565.80.360.71700.00230.0000.430.95630.9721
1931.1–1562.80.250.79910.00230.0000.540.95530.9722
2019.0–1574.83.750.00020.00230.0000.670.95790.9743

Table 3.

Nine measures at principal component regression.

Score vector no 14 is not significant, and perhaps should be excluded. It seems appropriate to choose dimension 17 here. It is common to remove score vectors that are not significant. However, it may not always be a good practice. The t-test is not a valid test and the score vectors beyond dimension 20 are so small that they have no practical importance.

8.2.4. Ridge regression

Here score vectors are selected that correspond to the associated eigenvalues of S = XTX + kI. The value of k is estimated by leave-one-out cross-validation, see reference [1]. The optimal value of k is k = 0.0002 (Table 4).

No(i)(ii)(iii)(iv)(v)(vi)(vii)(viii)(ix)
10782.7–1250.6–2.390.01760.05860.0000.150.79840.9176
11410.6–1347.7–11030.00000.05550.0000.160.87220.9520
12382.1–1355.43.220.00150.03220.0000.200.87830.9497
13151.7–1461.5–11.590.00000.03020.0000.190.93340.9681
14153.3–1458.6–0.480.63450.01080.0000.230.92920.9673
1585.8–1504.0–7.100.00000.01080.0000.280.94210.9717
1657.9–1525.84.930.00000.00710.0000.320.95080.9731
1723.71557.75.900.00000.00530.0000.350.95660.9734
1825.6–1554.80.340.73450.00230.0000.430.95630.9727
1927.5–1551.70.230.81680.00230.0000.530.95530.9728
2019.0–1560.03.240.00140.00230.0000.650.95790.9744

Table 4.

Nine measures at ridge regression.

It seems also appropriate here to choose dimension 17. The value of k is based on the full model. However, there is no practical difference between dimension 17 and a full model.

8.2.5. PLS regression

When working with PLS regression, each step is analysed closer. The score vectors associated with the test set are computed as Tt = Xt × V. The correlation coefficient between response values of the test set, Yt, and the 12th score vector is –0.058 and the associated p-value is 0.197. Therefore, we can conclude that the 12th score vector does not contribute to the modelling task. We can thus conclude that the dimension should be 11 (Table 5).

No(i)(ii)(iii)(iv)(v)(vi)(vii)(viii)(ix)
10207.0–1514.57.140.0000.02960.0000.110.94260.9738
11137.01553.06.590.0000.01160.0000.140.94980.9735
12107.8–1570.94.550.0000.01000.0000.170.95680.9725
1367.3–1600.55.750.0000.01010.0000.200.95870.9744
1458.0–1607.13.020.0030.00240.5630.260.96280.9754
1539.7–1622.54.220.0000.00400.0000.350.96560.9762
1626.6–1634.43.780.0000.00240.0880.430.96540.9775
1722.4–1637.72.430.0160.00230.0130.470.96710.9804
1823.5–1635.60.960.3400.00050.8490.520.96730.9804
1921.1–1637.32.080.0390.00040.8220.850.96790.9826
2019.0–1638.72.020.0450.00060.6411.120.96740.9816

Table 5.

Nine measures at PLS regression.

8.2.6. H-method, maximal R2 value

Here, it can be recommended to use 13 components. When 13 have been selected, there is no covariance left (see steps (v) and (vi)) (Table 6).

No(i)(ii)(iii)(iv)(v)(vi)(vii)(viii)(ix)
10243.3–1488.58.860.0000.03340.0000.130.94990.9731
11116.1–1559.09.130.0000.01680.0000.180.95560.9759
1275.4–1587.55.650.0000.00960.0050.210.95760.9757
1368.31591.92.650.0090.00860.0010.220.95920.9767
1438.1–1617.45.330.0000.00270.5460.500.96150.9753
1530.9–1623.22.900.0040.00400.0200.560.96370.9764
1626.5–1626.62.450.0150.00220.3620.590.96540.9798
1722.5–1629.82.420.0160.00180.2750.630.96710.9823
1820.8–1630.71.910.0580.00110.6330.700.96740.9825
1919.0–1631.71.940.0540.00080.6910.81096740.9815
2019.0–1630.81.410.1610.00060.7440.970.96720.9816

Table 6.

Nine measures at H-method determining maximal R2 value along the covariance.

8.3. Evaluation of modelling results

When modelling has been carried out, industry standards recommend to apply the model to 40 new samples, (Xn, yn). The estimated values are ŷn = Xnb. Xn are the new X-values and b is the regression coefficients from the method. Results are shown in Table 7.

Stepw. max covStepw. max R2PCRRidge regressionPLS regressionH-method Max R2
(1|yny^n|2/(ynTyn))0.93460.93460.93300.93330.94720.9391
Var(b)0.462.280.350.350.140.22

Table 7.

Comparison of results from six methods.

From Table 7, it can be seen that PLS regression is slightly better than the other methods. The table also shows the estimate of the standard deviations of the regression coefficients. It shows that the PLS regression has much smaller value than the other methods. In conclusion, the PLS regression can be recommended for determining y-values for future samples.

Note that the values in the first row in Table 7 are smaller than those were obtained for cross-validation and test sets during the calibration analysis. It indicates that the new 40 samples deviate in some way from the 160 samples that were used in the analysis. This is not explored further here.

8.4. Confidence interval for regression coefficients

Approximate confidence intervals for regression coefficients can be obtained by the procedure presented in reference [10]. Let ei be the ith residual obtained by a regression method. Define

hi=xiSA1xiT and ri=ei1hi, i=1,2,,N.E49

A new set of residuals ei* are defined by randomly sampling from the modified residuals r1, …, rN. By repeating the generation of (ei*), a new set of regression coefficients are obtained. This can be repeated say 200 times to get a confidence interval for the regression coefficients.

Advertisement

9. Discussion

Interpretation of Tables 16 reflects personal experience. Others may select the dimension in a different way. The main issue is that the dimension of all the selected methods is less than 20. If a full model including all 40 variables is estimated, we are overfitting by a dimension over 20. Regression coefficients become large and inference from the estimation not reliable.

The advantage of working with latent variable is that the first ones collect a large amount of variation. This is well known from principal component analysis. Interpretation of variables cannot be done directly in latent variable models. However, using Eqs. (17) and (18) we can set up equations that show how individual variables contribute to the fit and regression coefficients.

Advertisement

10. Applications of the H-principle

The H-principle has been extended to many areas of applied mathematics. It has in general been successful due to the presence of latent structure in data. The implementation of the H-principle is different from area to area. Furthermore, ‘household’ administration may be needed in order to keep the number of variables low (cf. Mallows theory). In many cases, it has opened up for new mathematics.

In reference [11], it has been extended to multi-block and path modelling, giving new methods to carry out modelling of organized data blocks. The importance of these methods is due to the fact that the methods of regression analysis are extended so that regression models are computed between data blocks. Methods of regression analysis can be used to evaluate relationships between data blocks. Complicated assumptions like we see in structural equations modelling are not needed.

Linear latent structure regression can be viewed as determining a low-dimensional hyperplane in a high-dimensional space. In reference [12], it is extended to finding low-dimensional second, third and higher order surfaces in latent variables. Deviations from linearity often appear as curvature for low and high sample values, which can be handled by these surfaces. In reference [13], it has been applied to non-linear estimation that may give good low-rank solutions, where full-rank regularized solutions do not give convergence.

In reference [14], it is extended to multi-linear algebra, where there are many indices in data, e.g. X = (xijk) and y = (yij). The basic issue in multi-linear algebra is defining the inverse. This is solved by defining directional inverses for each dimension. It makes it possible to extend methods of ordinary matrix analysis to multi-linear algebra. These methods have been successfully applied to multi-linear data and to growth models.

The H-principle has been applied to several areas of applied statistics. Here we briefly mention a few.

In time series analysis and dynamic systems, the objective of modelling is both to describe the data and also to obtain good forecasts. Thus, the requirement to a latent variable is both that it describes X and also gives good forecasts. Traditional models only focus on the description of X. By requiring that the latent variables also should give good forecast, better models are obtained.

In pattern recognition and classification, the objective of modelling is both to obtain good description of each group of data (that are detected or given a-priori) and to get low-error rate of classification. There are different ways to implement the H-principle in these areas, which have been developed. Applications show that these methods are superior to those based on statistical theory (e.g. linear and quadratic discriminate analysis based on the normal distribution, principal component analysis of each group of data).

11. Conclusion

We have presented a short review of standard regression analysis. Theoretically, it has important properties, which makes it a standard approach in popular program packages. However, we show that the results obtained may not always be reliable, when there is a latent structure in data. This is a serious problem in industry and applied sciences, because data typically have latent structure.

The H-principle is formulated in close analogy to the Heisenberg uncertainty inequality. It suggests that in the case of uncertain data, the computation of the mathematical solution should be carried out in steps, where at each step an optimal balance between the fit and associated precision should be obtained. A general framework is presented for linear regression. Any set of weight vectors can be used that do not give loading vectors of zero size. Using the framework, six different regression methods are carried out. It is shown that the methods give low rank solutions. The same type of numerical and graphic analysis can be carried out for any type of regression analysis within this framework. Traditional analysis in applied statistics and graphic analysis, which is popular in chemometrics, can be carried out for each method that uses the framework.

The algorithm can be viewed as an approximation to the full rank solution. Modelling stops, if further steps are not supported by data. Dimension measures, cross-validation and test sets are used to identify, when the steps are supported by data.

Acknowledgments

The cooperation with Clinical Biochemistry, Holbæk Hospital, Denmark, in the Sime project is highly appreciated. The author appreciates the use of the data from the project in the present article.

References

  1. 1. Höskuldsson A, A common framework for linear regression, Chemometrics and Intelligent Laboratory Systems 146 (2015) 250–262. DOI: 10.1016/j.chemolab.2015.05.022
  2. 2. Reinikainen SP, Höskuldsson A, COVPROC method: strategy in modeling dynamic systems, Journal of Chemometrics 17 (2003) 130–139. DOI: 10.1002/cem.770
  3. 3. McLeod G, et al. A comparison of variate pre-selection methods for use in partial least squares regression: a case study on NIR spectroscopy applied to monitoring beer fermentation, Journal of Food Engineering 90 (2009) 300–307. DOI: 10.1016/j.jfoodeng.2008.06.037
  4. 4. Tapp HS, et al., Evaluation of multiple variate methods from a biological perspective: a nutrigenomics case study, Genes Nutrition 7 (2012) 387–397. DOI: 10.1007/s12263-012-0288-4
  5. 5. https://en.wikipedia.org/wiki/Mallows’s_Cp
  6. 6. https://en.wikipedia.org/wiki/Akaike_information_criterion
  7. 7. Siotani M, Hayakawa T, Fujikoshi Y, Modern Multivariate Analysis: A Graduate Course and Handbook. American Science Press: Columbus, Ohio, 1985.
  8. 8. Höskuldsson A: Prediction Methods in Science and Technology, Vol. 1, Thor Publishing: Copenhagen, 1996, ISBN 87-985941-0-9
  9. 9. Clinical and Laboratory Standards Institute, http://shop.clsi.org/chemistry-documents/
  10. 10. Davison AC, Hinkley DV: Bootstrap Methods and their Application, Cambridge University Press: Cambridge, New York, 1997.
  11. 11. Höskuldsson A, Modelling procedures for directed network of data blocks, Chemometrics and Intelligent Laboratory Systems 97 (2009) 3–10. DOI: 10.1016/j.chemolab.2008.09.002
  12. 12. Höskuldsson, A. The Heisenberg modelling procedure and applications to nonlinear modelling. Chemometrics and Intelligent Laboratory System 44 (1998) 15–30. DOI: 10.1016/S0169-7439(98)00111-7
  13. 13. Höskuldsson A, H-methods in applied sciences, Journal of Chemometrics 22 (2008) 150–177. DOI: 10.1002/cem.1131
  14. 14. Höskuldsson A, Data analysis, matrix decompositions and generalized inverse, SIAM Journal on Scientific Computing, 15 (1994) 239–262. DOI: 10.1137/0915018

Written By

Agnar Höskuldsson

Submitted: 13 April 2016 Reviewed: 04 October 2016 Published: 26 April 2017