Open access peer-reviewed chapter

Fitting Models to Data: Residual Analysis, a Primer

By Julia Martin, David Daffos Ruiz de Adana and Agustin G. Asuero

Submitted: December 9th 2016Reviewed: February 21st 2017Published: July 5th 2017

DOI: 10.5772/68049

Downloaded: 990

Abstract

The aim of this chapter is to show checking the underlying assumptions (the errors are independent, have a zero mean, a constant variance and follows a normal distribution) in a regression analysis, mainly fitting a straight‐line model to experimental data, via the residual plots. Residuals play an essential role in regression diagnostics; no analysis is being complete without a thorough examination of residuals. The residuals should show a trend that tends to confirm the assumptions made in performing the regression analysis, or failing them should not show a tendency that denies them. Although there are numerical statistical means of verifying observed discrepancies, statisticians often prefer a visual examination of residual graphs as a more informative and certainly more convenient methodology. When dealing with small samples, the use of the graphic techniques can be very useful. Several examples taken from scientific journals and monographs are selected dealing with linearity, calibration, heteroscedastic data, errors in the model, transforming data, time‐order analysis and non‐linear calibration curves.

Keywords

  • least squares method
  • residual analysis
  • weighting
  • transforming data

“Since all models are wrong the scientist cannot obtain a “correct” one by excessive elaboration. On the contrary following William of Occam he should seek an economical description of natural phenomena. Just as the ability to devise simple but evocative models is the signature of the great scientist so overelaboration and overparameterization is often the mark of mediocrity”.

(Box, G.E.P. Science and Statistics, J. Am. Stat. Ass. 1976, 71, 791–796)

1. Introduction

The purpose of this chapter is to provide an overview of checking the underlying assumptions (errors normally distributed with zero mean and constant variance (σi2), being independent one of each other) in a regression analysis, via the use of basic residual plot, such as plots of residuals versus the independent variable x. Compact formulae for the weighted least squares calculation of the a0 (intercept) and a1 (slope) parameters and their standard errors [1, 2] are shown in Table 1. The similarity with simple linear regression is obvious, simply making the weighting factors wi = 1. A number of selected examples taken from scientific journals and monographs are subject of study in this chapter. No rigorous mathematical treatment will be given to this interesting topic. Emphasis is mainly placed on a visual examination of residuals to check for the model adequacy [37] in regression analysis. The role of residuals in regression diagnostics is vital, being necessary with their thorough examination to consider an analysis as complete [810].

Equation: y^i=a0+a1xiSlope: a1=SXY/Sxx
Weights: wi=1/si2Intercept: a0=y¯a1x¯
Explained sum of squares: SSReg=wi(y^iy¯)2Weighted residuals: wi1/2(yiy^i)
Residual sum of squares: SSE=wi(yiy^i)2Correlation coefficient: r=SXY/SXXSYY
Mean: x¯=wixi/wiy¯=wiyi/wi
Sum of squares about the mean: SXX=wi(xix¯)2SYY=wi(yiy¯)2SXY=wi(xix¯)(yiy¯)
Standard errors: sy/x2=SSEn2=SYYa12SXXn2
sa02=sy/x2wixi2SXXwi
sa12=sy/x2SXX
cov(a0,a1)=x¯sy/x2/SXX

Table 1.

Formulae for calculating statistics for weighted linear regression (WLR).

The residuals are geometrically the distances calculated in the y‐direction [1, 2, 11, 12] (vertical distances) between the points and the regression line (error free in the independent variable)

ri=yiy^iE1

The calculated regression line

y^i=a0+a1xiE2

corresponds to the model

yi=α0+α1xi+εiE3

where εi is the random error (a0 and a1, Eq. (2), are the estimates of the true values α0 and α1), leading to a sum of squares of the residuals minimum

Qmin=[ri2]minE4

Note that the model error is given by

εi=yiE(yi)E5

where E(yi) is the expected value of yi [6]. Thus, the residuals ri may be viewed as the differences between what is really observed, and what is predicted by the model equation (i.e. the amount that the regression equation is not able to explain). The residuals ri may be thought as the observed errors [10] in correct models. The residuals reveal the existing asymmetry [13] in the functions of the response and the independent variable in regression problems. A number of assumptions concerning the errors [14, 15] have to be made when performing a regression analysis, for example, normality, independence, zero mean and constant variance (homoscedasticity property).

An assumption that the errors are normally distributed is not required to obtain the parameter estimates by the least squares method. However, for inferences and estimates (standard errors, t‐ and F‐test, confidence intervals) to be made about regression, it is necessary to assume that the errors are normally distributed [11]. The assumption of normality, nevertheless, is plausible as in many real situations errors tend to be normally distributed due to the central limit theorem. The assumption that no residual term is correlated with another, combined with the normality assumption, means [10] the errors are also independent. Constructing a normal probability plot of the residuals [1618] is a way to verify the assumption of normality. Residuals are ordered and plotted against the corresponding percentage points from the standardized normal distribution (normal quantities). If the residuals are then situated along a straight line, the assumption of normality is satisfied.

A standardized residual is the residual divided by the standard deviation of the regression line

eri=risy/xE6

The standardized residuals are normally distributed with a mean value of zero and (approximately) unity variance [10, 19]

Var(ri)=Var(yi)Var(y^i)=σ2σ2(1wi+(xx¯)2SXX)=σ2(11wi(xx¯)2SXX)=σ2(1hii)E7
Var(eri)=1hiiE8

The hii term may be regarded as measuring the leverage of the data point (xi,yi) (see below). The estimated residuals are correlated [10], but this correlation does not invalidate the residual plot when the number of points is large in comparison with the number of parameters estimated by the regression. As pointed out by Behnken and Draper [20]: ‘In many situations little is lost by failing to take into account the differences in variances’. Standardized residuals are useful in looking for outliers. They should have random values, the 95% falling between −2 and 2 for normal distribution.

The tendencies followed by the residuals should confirm the assumptions we have previously made [21], or at least do not deny them. Remember the sentence of Fischer [22, 23]: ‘Every experiment may be said to exist only in order to give the facts a chance of disproving the null hypothesis’. Conformity assays of the assumptions inherent to regression fall mainly in the examination of residual pattern. Although there are statistical ways of numerically measuring some of the observed discrepancies [24], graphical methods play an important role in data analysis (Table 2) [2531]. A quick examination of residuals often allows obtaining more information than significance statistical tests of some limited null hypothesis. Nevertheless, objective, unambiguous determination should be based on standard statistical methodology. This chapter is mainly focused on residual plots rather than on formulas, or hypothesis testing. As we will see in the selected examples, the plots easily reveal violations of the assumptions if they are severe enough as to warrant any correction.

SentenceAuthor(s)/reference
‘Most assumptions required for the least squares analysis of data using the general linear model can be judged using residuals graphically without the need for formal testing’Darken [25]
‘Graphical methods have an advantage over numerical methods for model validation because they readily illustrate a broad range of complex aspects of the relationship between the model and the data’.NIST/SEMATECH [26]
‘There is no single statistical tool that is a powerful as a well‐chosen graph’Huber [27]
‘Although there are statistical ways of numerically measuring some of the observed discrepancies, statistician themselves prefer a visual information of the residual plots as being more informative and certainly more convenient’Belloto and Sokoloski [28]
‘Eye‐balling can give diagnostic insights no formal diagnostic will ever provide’Chambers et al. [29]
‘Graphs are essential to good statistical analysis’Anscombe [30]
‘One picture says more than a thousand equations’Sillen [31]

Table 2.

Sentences of some authors about the use of graphical methods.

The main forms of representation [10] of residuals are (i) global; (ii) in temporal sequence, if its order is known; (iii) faced to the adjusted values, y‐hat; (iv) facing the independent variable xji for j = 1,2… k; and (v) in any way that is sensitive to the problem subject of analysis.

The following points can be verified in the representation of the residuals: (i) the form of the representation, (ii) the number of positive and negative residuals should be equivalent of vanishing median, (iii) the sequence of residual signs must be randomly distributed between + and −, and (iv) it is possible to detect spurious results (outliers); their magnitudes are greater than the rest of the residuals.

Residual plots appear more and more frequently [3239] in papers published in analytical journals. In general, these plots as well as those discussed in this chapter are very basic and can undergo some criticism. For example, the residuals are not totally distributed independent of x, since [10, 19] the substitution of the estimates by the parameters introduces some dependence. However, more sophisticated methods have been developed [4044] based on standardized, studentized, jack‐knife, predictive, recursive residuals, and so on (Table 3). In spite of their importance, they are considered beyond the scope of this contribution.

SymbolNameFormulaComments
eiClassical residualsei
eNiNormalized residualseisy/x
eSiStandardized residualseisy/x1hiiIdentification of heteroscedasticity
eJiJack‐knife residualseinm1nmeSi2Identification of outliers
ePiPredicted residualsei1hii
eRiRecursive residualsIdentification of autocorrelations

Table 3.

Types of residuals and suitability for diagnostic purposes [4244].

Despite the frequency with which the correlation coefficient is referred to in the scientific literature as a criterion of linearity, this assertion is not free from reservations [1, 4549] as evidenced several times throughout this chapter.

The study of linearity not only implies a graphic representation of the data. It is also necessary to carry out a statistical check, for example, the analysis of the variance [5054], which requires repeated measurements. This implies the fulfilment of two requirements: the homogeneity (homoscedasticity) of the variances and the normality of the residuals. Incorporating replicates to the calibration estimation offers a possibility to look at the calibration not only in the context of fitting but also of the uncertainty of measurements [15]. However, if replicate measurements are not made, and an estimate of the mean square error (replication variance) is not available, the regression variance

sy/x2=(yiy^i)2n2=ri2n2E9

may be compared with the estimated variance around the mean of the yi values [55]

sy2=(yy¯)2n1E10

by means of an F‐test. The goodness of fit of non‐linear calibration curves is improved by raising the degree of the fitting polynomial, performing then an F‐test (quotient of the residual variance for the kth to k + first‐polynomial degree) [56]. A suitable test can also be carried out according to Mandel [56, 57]. However, this contribution is essentially devoted to the use of basic graphical plots of residuals, a simple and at the same time a powerful diagnostic tool, as we will have the occasion to claim through this chapter.

Several examples taken from scientific journals and monographs are selected in order to illustrate this chapter: (1) linearity calibration methods: fluorescence data [58] as an example; (2) calibration graphs: the question of intercept [59] or non‐intercept; (3) errors are not in the data, but in the model: the CO2 vapour pressure [59] versus temperature dependence; (4) the heteroscedastic data: high‐performance liquid chromatography (HPLC) calibration assay [60] of a drug; (5) transforming data: preliminary investigation of a dose‐response relationship [61, 62]; the microbiological assay of vitamin B12; (6) the variable that has not yet been discovered; the solubility of diazepan [28] in propylene glycol; and (7) no models perfect: nickel(II) by atomic absorption spectrophotometry.

2. Linearity in calibration methods: fluorescence data as example

Calibration is a crucial step, an essential part, the key element, the soul of every quantitative analytical method [38, 40, 6369], and influences significantly the accuracy of the analytical determination. Calibration is an operation that usually relates an output quantity to an input quantity for a measuring system under given condition (The International Union of Pure and Applied Chemistry (IUPAC)). The topic has been the subject of a recent review [67] focused on purely practical aspects and obviating the mathematical and metrological aspects. The main role of calibration is transforming the intensity of the measured signal into the analyte concentration in a way as accurate and precise as possible. Guidelines for calibration and linearity are shown in Table 4 [7081].

Scientific Association or AgencyReference
Calibration
International Union of Pure and Applied Chemistry (IUPAC)Guidelines for calibration in analytical chemistry [70]
International Organization for Standardization (ISO)ISO 8466‐1:1990 [70]; ISO 8466‐2:2001 [71]; ISO 11095:1996 [72]; ISO 28037:2010 [73]; ISO 28038:2014 [74]; ISO 11843‐2: 2000 [75]; ISO 11843‐5: 2008 [76]
LGC Standards Proficiency TestingLGC/VAM/2003/032 [77]
Linearity
International Conference on Harmonization (ICH)Guideline Q2A [78]
Clinical Laboratory Standard Institute (CLSI)EP6‐A [79]
Association of Official Analytical Chemists (AOAC)AOAC Guidelines 2002 [80]
European UnionEC 2002/657 [81]

Table 4.

Scientific organizations that approve calibration guidelines [7081].

Linearity is the basis of many analytical procedures. It has been defined as [78] the ability (within a certain range) to obtain test results that are directly proportional to the concentration (amount) of analyte in the sample. Linearity is one of the most important characteristics for the evaluation of accuracy and precision in assay validation, and as seldom is the case where a calibration curve is perfectly linear, it is crucial to access linearity during method validation. Such evaluation is also recommended in regulatory guidelines [7881]. Although it may seem that everything has been said on the subject of linearity, it is still an open question and subject to debate. It is therefore not surprising that some proposals are made from time to time to resolve this issue [54, 8292].

However, in calibration, statistical linearity tests between the variables are rarely performed in analytical studies. When dealing with regression models, the most convenient way of testing linearity beside a visual assessment is plotting the residual sequence in the concentration domain. A simple nonparametric statistical test for linearity, known as ‘the sign test’ [9, 16, 28], is based on the examination of the residuals (ri) sign sequence.

The residuals should be distributed in a random way. That is, the number of plus and minus residuals sign should be equal with the error symmetrically distributed (null hypothesis for the assay) when the variables are connected through a true linear relationship. The probability to get a random residual signs pattern is related to the number of runs in the sequence of signs. Intuitively and roughly speaking, the more these changes are randomly distributed [93] the best is the fit. A run is a sequence of the same sign with independence of its length. A pattern of residual signs of the kind [+ ‐ ‐ + + ‐ + ‐ + ‐ +], from independent measurements, is considered as random, whereas a pattern like this [‐ ‐ ‐ + + + + + + ‐ ‐] is not. Though a formal statistical test may carry out [94] with the information afforded by the residual plot, it is necessary a number of points greater than is usual in calibrate measurements.

The fluorescence in arbitrary units of a series of standards is shown in Table 5. To these data that appear to be curved, a straight line may be fitted (Figure 1, top) which results in an evident lack of fit, though the correlation coefficient (R) of the line is equal to 0.995 2. A plot of the resulting residuals ri against the x‐values (reduced residuals on the secondary axis) is also shown in Figure 1 (top), and allows checking for systematic deviations between data and model.

Concentration (μM)Fluorescence (arbitrary units)Concentration (μM)Fluorescence (arbitrary units)
00.2620.4
13.6722.7
27.5825.9
311.5927.6
4151030.2
517

Table 5.

Calibration fluorescence data [58].

Figure 1.

Fitting a straight line (top), a quadratic function (middle) and a cubic function (bottom) to fluorescence data compiled in Table 5.

The pattern of the sign of the residuals indicates that fitting the fluorescence data by a straight‐line equation is inadequate, higher‐order terms should possibly be added to account for the curvature. Note that the straight‐line model is not adequate even though the reduced residuals are less than 1.5 in all cases. When an erroneous equation is fitted to the data [9597], the information contained in the form of the residual plots is a valuable tool, which indicates how the model equation must be modified to describe the data in a better way. A curved calibration line may be fitted to a power series. The use of a quadratic (second‐degree) equation is enough in this case to obtain a good fit: the scattering of the residuals above and below the zero line is similar, as shown in Figure 1 (middle). Then, when no obvious trends in the residuals are apparent, the model may be considered to be an adequate description of the data. The simplest model or the model with the minimum number of parameters that adequately fit the data in question is usually the best choice. ‘Non sunt multiplicanda entia praeter necessitatem’ (Occam’s razor) [98]. In fact, the order of the polynomial (k) must not rise above 2 since [sy/x]k = 2 = 0.3994 < [sy/x]k = 3 = 0.4142.

In summary, when it is assumed a correct relationship between the response and the independent variable (s), the residual plot should resemble that of Figure 2 (left). All residuals should fall into the gravelled area, with a non‐discernable pattern, that is, random. If the representation of the residuals resembles that of Figure 2 (right), where curvature is appreciated, the model can probably be improved by adding a quadratic term or higher‐order terms, which should better describe the model with the required curvature.

Figure 2.

Residuals in a random (left) and parabolic (right) form.

Calibration curves with a non‐linear shape also appear in analytical chemistry [99104]. When the data in the x‐range (calibration) vary greatly as it does in many real problems, the response becomes non‐linear (Table 6) [101, 105107] at sufficiently large x‐values. The linear range of liquid chromatography‐tandem mass spectrometry (LC‐MS/MS) is typically about three orders of magnitude. The analyst’s usual response in this case is to restrict sometimes the concentration range with the purpose of using a linear response, thus introducing biases in the determination, since the choice of the ‘linear region’ is usually done in an arbitrary way. The use of a wider range in standard curve is preferred in order to avoid the sample dilution, saving time and labour [108]. An acceptable approach to extend the dynamic range of the standard curve is the use of quadratic regression. Among the possible causes of standard curve non‐linearity are saturation at high concentration during ionization, the formation of dimer/multimer/cluster ions, or detector saturation. It has been established that when the analyte concentration is above ∼10−5 M, its response starts to saturate providing non‐linear response.

Response functionNameTechnique
y=a0+a1xBeer lawAbsorption spectrophotometry
y=A+BlogxNernst equationElectrochemistry
y=axn(logy=nlogx+loga)Scheibe‐LomakinEmission spectroscopy ESI‐MS; ELSD; CAD
y=axn+a0(0<y<y)y=kx+y0(y>y)TLC‐densitometry“
bnyn+b1y=xDAD
by+b0=xESI‐MS
y=a0+a1x+a2x2Wagenaar et al.Atomic absorption spectrophotometry, liquid chromatography/MS/MS
logy=a0+a1logx+a2logx2CAD
y=a0+a1x+a2x2+a3x3Ion‐Trap‐MS
y=A+B(1expCx)Andrews et al.Atomic absorption spectrophotometry
y=AD1+(xC)B+DRodbardRadioimmunoassay

Table 6.

Response functions used in instrumental analysis [105107].

ESI‐MS: electrospray ionization mass spectrometry; TCD: thermal conductivity detector; TLC‐densitometry: TLC with detection by optical densitometry; ELSD: evaporative light scattering detector; CAD: charged aerosol detection.


Quadratic curve‐fitting calibration data are more appropriate [104, 109117] than straight‐line linear regression, in the case of some quantification methods. Matrix‐related non‐linearity is typical of methods such as LC‐MS/MS. In order to provide an appropriate validation strategy for such cases, the straight‐line fit approximation has been extended to quadratic calibration functions. When such quadratic terms are included [10, 118120], precautions should be taken because of the consequent multicollinearity problems.

However, the use of quadratic regression model is considered as less appropriate or even viewed with suspicion by some regulatory agencies and, as a result, not often used in regulated analysis. In addition, the accuracy around the upper limit of quantitation (ULOQ) can be affected if the curve range is extended to the point where the top of the curve is flat.

Statistical tests may also be considered for providing linearity, like Mandel’s test [57] for comparison errors of residuals of quadratic and linear regression by means of an F‐test at a determined significance level, or like lack‐of‐fit test by analysis of variance (ANOVA) or testing homoscedasticity (the homogeneity of variance residuals).

3. Calibration graphs: the question of intercept or non‐intercept

Absorption spectrometry is an important analytical technique, and, to be efficient, the calibration must be accomplished with known samples. Data for the calibration of an iron analysis, in which the iron is complexed with thiocyanate, are shown in Table 7. The absorption of the iron complex is measured and depicted versus iron concentration in ppm. The standard deviation of the regression line, sy/x, obtained from the experimental data, quantifies the quality of fit and the accuracy of the model to predict the values of y (the measured quantity), for a given value of x (the independent variable).

Concentration, Fe (ppm)AbsorbanceConcentration, Fe (ppm)Absorbance
0.36440.02683.6440.248
0.72880.05067.2880.495
1.0830.078332.81.52

Table 7.

Absorbance data for Fe3+ calibration (as Fe‐SCN2+ complex) [59].

The regression line is first computed by forcing it to pass through the coordinate origin (a0 = 0), since the absorbance should be directly proportional to the concentration (at zero concentration, a zero absorbance might be expected). However, the adjustment thus obtained is not very good. The representation of residuals shows the pattern of signs + + + + + ‐ (Figure 3, lower left). If we compute the regression line with intercept (Figure 3, upper right), the correlation coefficient increases from 0.990 8 to 0.994 7, but the pattern of non‐random signs persists, that is, ‐ ‐ ‐ + + ‐ (Figure 3, lower right). What is a reasonable explanation? If the highest concentration point (32.8 ppm) is discarded, all the other points appear to fall on a straight line. However, this point cannot be discarded on the basis of its deviation from the best‐fit line, because it is closer to the line than other calibration points in the series. As a matter of fact, the last point (32.8 ppm) defines where the line must pass: being so distant, it has a great influence (leverage) and forces the least squares regression line in its direction. The non‐robustness behaviour is a general weakness of the least squares criterion. Very bad points [59] have a great influence just because their deviation from the true line, which rises to the square, is very large. One has to be aware of dubious data by correcting obvious copying errors and adopting actions coherent with the identified causes. Improper recording of data (e.g. misreading responses or interchange of digits) is frequently a major component of the experimental error [121].

Figure 3.

Calibration curve and residuals for the Fe‐SCN2+ system without intercept (left) and with intercept (right) included in the model for all data compiled in Table 7.

A few high or low points [8] can alter the value of the correlation coefficient in a great extension. Larger deviations present at larger concentrations tend to influence (weight) the regression line more than smaller deviations associated with smaller concentrations, and thus the accuracy in the lower end of the range is impaired. It is therefore very convenient [122124] to analyse the plotted data and to make sure that they cover uniformly (approximately equally spaced) the entire range of signal response from the instrument (85). Data should be measured at random (to avoid confusing non‐linearity with drift). The individual solutions should be prepared from the same stock solution, thus avoiding the introduction of random errors from weighing small quantities from individual standards. Depending on the location of the outliers, the correlation coefficient may increase or decrease. In fact, a strategically situated point can make the correlation coefficient varies practically between −1 and +1 (Figure 4), so precautions should be taken when interpreting its value. However, points of influence (e.g. leverage points and outliers) (Table 8) are rejected only when there is an obvious reason for their anomalous [125] behaviour. The effect of outliers is greater as the sample size decreases. Duplicate measurements, careful scrutiny of the data while collecting and testing discrepant results with available samples may aid to solve problems [28] with outliers.

Gross errorsCaused by outliers in the measured variable or by the leverage points extremes
Golden pointsSpecial chosen points which have been very precisely measured to extend the prediction capability of the system
Latently influential pointsConsequence of a poor regression model
According to data location
OutliersDiffers from the other points in values of the y‐axis
Leverage pointsDiffer from the other points in values of the x‐axis or in a combination of these quantities (in the case of multicollinearity)

Table 8.

Influential points [44].

Figure 4.

Influence of an anomalous result on the least squares method (solid line) and on the correlation coefficient.

If the regression analysis is made without the 32.8 ppm influence point forcing to pass through the origin, the correlation coefficient reaches the 0.999 91 value (Figure 5, top left). This point was not considered because a high deviation standard values were above the new line. Perhaps, the problem observed with the 32.8 ppm point is due to the fact that sulphocyanide (thiocyanate) is not in enough excess to complex all the iron present. However, the inspection of residuals (+ + + + ‐) shows systematic, non‐random deviations (Figure 5, bottom left), which may indicate an incorrect or inadequate model. Systematic errors of analysis translate into (systematic) deviations from the fit equation (negative residuals correspond to low estimated values, and positive residuals to high). An erroneous omission of the intercept term in the model may be the cause of this effect. The standard deviation of the regression line improves notably, from 0.0026 to 0.0017, when the intercept is introduced (Figure 5, top right) in the model (correlation coefficient equals 0.999 97), the residual pattern being now random (‐ ‐ + ‐ +) (Figure 5, bottom right). The calibration is then appropriate and linear, at least up to 8 ppm. However, the intercept value, 0.0027, is of the same order of magnitude as the standard deviation, sy/x, of the regression line. A calibration problem (of minor order) may be apparent, for example, the spectrophotometer was not properly set to zero or the cuvettes were not conveniently matched.

Figure 5.

Calibration curve and residuals for the Fe‐SCN2+ system without intercept (left) and with intercept (right) included in the model (data in Table 7 with the exception of the 32.8 ppm point).

Residual analysis of small sample sizes has [126] some complications. Firstly, residuals are not totally distributed in an independent way from x, because the substitution of the parameters by the estimators introduces [10, 19] certain dependence. Secondly, a few points far from the bulk of the data may eventually condition the estimators, residuals and inferences.

4. Error is not in the data, but in the model: the CO2 vapour pressure versus temperature case

The linear variation of physical quantities is not a universal rule, although it is often possible to find a coordinate transformation [18] that converts non‐linear data into linear ones. The vapour pressure, P, in atmospheres, of carbon dioxide (liquid) as a function of temperature, T, in degrees Kelvin, is not linear (Table 9). Carbon dioxide found its use in chemical analysis as a supercritical fluid for extracting the caffeine from the coffee. We may expect, on the basis of the Clausius‐Clapeyron equation, to fit the data compiled in Table 9 into an equation of the form

Temperature (°K)Vapour pressureTemperature (°K)Vapour pressure
216.55005.11023266.494428.70169
222.05006.44393272.050033.39684
227.60568.04301277.605638.63636
233.16119.92107283.161144.47469
238.716712.09853288.716750.93903
244.272214.62303294.272258.07022
249.827817.50817299.827865.91589
255.383320.78797304.161172.76810
260.938924.51007

Table 9.

CO2 vapour pressure versus temperature data [59].

lnP=A+B/TE11

This requires a transformation of the data. If we define

Y=lnPE12

and

X=1/T,E13

this form is linear,

Y=A+BXE14

The resulting graph (Figure 6, middle solid line) examined, appears to be fine, like calculated statistics, and so there is no reason at first to esteem any problem. Results lead to a correlation coefficient of 0.999 988 76. This almost perfect adjustment is indeed very poor when attention is paid to the potential quality of the fit as shown by the sinusoidal pattern of residuals [+ + ‐ + ‐ + + + + + ‐ ‐], which are incorporated in the figure to the resulting least squares regression line. As the details of measurements are unknown, it is not possible to test for systematic error in the experiments. The use of an incorrect or an inadequate model is the reason, which explains in this case the systematic deviations. The Clausius‐Clapeyron equation does not exactly describe the phenomenon when the temperature range is wide. Results similar to those shown in Figure 6 are also obtained by applying weighted linear regression by using weighting factors defined by [6, 7, 127129]

Figure 6.

Top: CO2 vapour pressure against temperature data (top). Middle: representation of ln P against the reciprocal of temperature (Clausius‐Clapeyron equation) including the residuals plot. Bottom: CO2 vapour pressure as a function of temperature according to the expanded (MLR) model.

wi=1(Yiyi)2=1(lnPiPi)2=Pi2E15

on the basis of the transformation used.

The error does not lie in the data then, but in the model. We may try to improve the latter by using a more complete form of the equation

lnP=A+B/T+ClnT+DTE16

The results now obtained (analysis by multiple linear regression) depicted in Figure 6 (bottom) are better than those obtained by using the single linear regression equation, with the residuals randomly distributed. Values of ln P may be calculated with an accuracy of 0.001 (or an accuracy level of 0.1%), as suggested by the standard deviation of the regression line obtained. In addition, as T is used as a variable, instead of its inverse, interpolation calculations are carried out in an easier way.

The moral of this section is that there are not perfect models [130, 131], but models that are more appropriate than others.

5. The heteroscedastic data: HPLC calibration assay of a drug

In those cases in which the experimental data to be used in a given analysis are more reliable than others [6, 61, 63], gross errors may be involved when the conventional method of least squares is applied in a direct way. The assumption of uniform or regular variance of y may be no correct when the experimental measurements cover a wide range of x‐values. There are two possible solutions to this non‐constant, irregular, non‐uniform or heteroscedastic variance problem: data transformation or weighted least squares regression analysis.

The squared sum of the weighted residuals [132, 133, 64]

Qmin,w=[wiri2]minE17
wi=1σi2E18

is minimized in the weighted least squares procedure. The idea underlying weighted least squares is to attribute the greatest worth [2, 40, 132, 66, 101, 135, 136] to the most precise data. The greater the deviation from the homoscedasticity, the greater the profit that can be extracted from the use of the weighted least squares procedure. The homoscedasticity hypothesis is usually justified in analytical chemistry in the framework of the calibration. However, when the range of abscissa values (concentration) covers several orders of magnitude, for example, in the study of (calibration) drug concentrations in urine or in other biological fluids, the accuracy of y‐values is strongly dependent of the x ones. In those cases, the homoscedastic requirement implied in single linear regression is violated, thus the introduction of weighting factors being mandatory. Some typical cases of heteroscedasticity appear [137, 138] involving a constant relative standard deviation (Figure 7)

Figure 7.

Left: hypothetical HPLC response versus concentration for a typical serum. Right: examples of relationships between concentration of analyte (x), standard deviation (SD), and coefficient of variation (CV).

RSD=σix¯E19

or a constant relative variance (radioactive accounts, Poisson distribution). Photometric absorbances by Beer’s law cover a wide concentration range and like chromatographic analysis in certain conditions, tend to be heteroscedastic. Inductive plasma‐coupling emission spectrometry coupled to mass spectrometry (ICPMS) requires weighted least squares estimates even when the calibration covers a relatively small concentration range. The standard deviation (absolute precision) of measurements σi usually increases with the concentration xi, whereas the relative standard deviation (relative precision, RSD) decreases instead.

It is possible to derive [138] relationships between precision and concentration through the concentration range essayed so that chemical methods are applied to found analytes present at varying concentrations. A number of different relationships [2, 139142] have been proposed (Table 10) for different authors, and ISO 5725 gives [143] indications to assist in obtaining a given C = f(σi) relationship.

σc=pCk(k=0.5,1,>1,)
σc=p(C+1)k
σc=peqC
σc=pCk+q
σc=a0+a1Cq(q=1,2)
σc=a0+a1C+a2C2
σc=pyk
σc=a0+a1y+a2y2

Table 10.

Relationship types between the standard deviation and the concentration of the analyte.

The advantages of the least squares method may be impaired if the appropriate weights are not included in the equations, despite being a powerful tool. The least squares criterion is highly sensitive to outliers, as we have seen in Figure 4. An undesirable paradox may often occur consisting in the fact that the experimental data of worst quality contribute most to the estimation of the parameters. Although replication may be severely limited [15, 132], it possesses the advantage to provide a certain kind of robust regression [144]. The most common method of performing a weighted regression is using weights values reciprocal to the corresponding variances values, that is,

wi=1si2E20

where si2 is the experimental estimate of σi2. Eq. (14) warrants that in using replication, the lower weights correspond to the outliers of yi. The incorporation of heteroscedasticity into the calibration procedure is preconized [145, 146] by several international organizations such as ISO 9169, and ISO/CD 13‐752. The International Union of Pure and Applied Chemistry (IUPAC) includes the heteroscedasticity [147, 148] or non‐constant variance topic for the calculation of the limits of detection and quantification.

The assumption of constant variance in the physical sciences may be erroneous [34, 149157]. The data from a calibration curve (Table 11) relating to the readings of an HPLC assay to the drug concentration in ng/mL in blood [60] are shown in Figure 8. A regression model reasonable for the mean values is y = α0 + α1 x, in the first approximation. However, the variability of the response increases in a systematic way with increasing the concentration level. This indicates that the constant variance assumption of the response through the range of concentrations assayed is not followed. In fact at the highest level of concentration, a very large response value is produced. There is no physical justification that allows excluding this value from the others in the analysis. Assuming in the first instance constant variance, the least squares are used to obtain as estimated parameters a0 = 0.0033 and a1 = 0.0014, with sy/x = 0.0265. The representation of the residuals versus the values of x should show variability with a constant band, if the model was appropriate. Note that, in Figure 9 (bottom left), the pattern of funnel shape or trumpet indicates that the measurement error is increasing as it does the mean response. The assumption of constant variance is thus not satisfied. On the other hand, the intercept value, 0.0033, is not in good agreement with the mean response value (13 replicates) at zero dose, 0.0021, which supposes another additional problem. The result of ignoring the non‐constant variance in this case results in a poor fit of the model. The weighted linear regression straight‐line model led instead (Figure 10, top) to the equation y = 0.0015 x + 0.0021, the band of residuals being now rectangular (Figure 10, bottom).

Doses
05154590
Response0.00160.01180.01070.1060.106
0.00190.01390.06700.0260.158
0.00020.00920.04100.0880.272
0.00300.00330.00870.0780.121
0.00420.01200.04100.0290.099
0.00060.00700.01040.0630.116
0.00060.00250.01700.0970.117
0.00110.00750.03200.0660.105
0.00060.01300.03100.0520.098
0.00130.0050
0.00200.0180
0.0050
0.0050

Table 11.

Calibration data for an HPLC blood concentration (ng/mL) assay of a drug [60].

Figure 8.

Calibration curve obtained by single linear regression (top) and corresponding residuals plot (bottom).

Figure 9.

Residuals in the form of funnel (left) and ascending (right).

Figure 10.

Response versus concentration obtained by weighted linear regression (top) and residuals plot for this model (bottom).

The weighted least squares method requires a higher number of replicates than the conventional least squares method. The estimation of the minimum number of replicates varies between six and 20, according to different authors. In practice, it is often difficult to reach such high level of replication [2, 15] for different reasons, such as cost or availability of calibration, standards and reagents, time demands on previous operations, or by recording of the chromatograms.

In order to apply the weighted least squares analysis, it is mandatory to assign weighting factors to the corresponding observations. In fact, the weighting factor is related with the information contained in the yi value, being proportional to the reciprocal of the variance of yi. The results of single‐trial assay without additional information seldom contain enough information as to model the variance in a satisfactory way. The independent variable may be usually choice, fortunately, by the researcher, and the corresponding values for the dependent thus replicated.

The general phenomenon of non‐constancy of variance is called as we have previously seen heteroscedasticity. It can be addressed [2, 15] by the weighted least squares method. A second method of dealing with heteroscedasticity is to transform the response [18] so that the variance of the transformed response is constant, proceeding then in the usual way, as in the following.

6. Transforming data: preliminary investigation of a dose‐response relationship

The non‐linear relationship between two variables may be sometimes handled as linear by means of [158] a transformation. A transformation consists in the application of a mathematical function to a set of data. The transformation leading finally to a straight‐line fit to the data can be carried out on a variable or on both. The transformation of data is sometimes understood as a device which statisticians use, a conviction founded on the preconceived idea that the natural scale of measurement [159] is something like sacrosanct. This is not like this, and in fact some measurements, for example, those of pH, are actually logarithmic, transformed values [160]

As much as the analyst wants the mould of nature to be linear, often in the curves truth is simply found [118, 160]. Real‐world systems sometimes do not fulfil the essential requirements for a rigorous or even an approximate validity of the method of analysis. In many cases, a transformation (change of scale) can sometimes be applied to the experimental data [18] in order to carry out a conventional analysis. Although it may seem that the best way to estimate the coefficients of a non‐linear equation is the direct use of a non‐linear regression program (NLR), NLR itself [161] is not without drawbacks and problems.

The data of turbidimetric measurements of the growth response of Lactobacillus leichmannii to vitamin B12 [61] provide a good illustration of a preliminary investigation of dose‐response relationships. Table 12 shows the responses to the eight different doses of vitamin B12 measured in six independent tubes per dose, which are depicted in Figure 11.

Doses (ng/tube)
0.230.350.530.791.191.782.674
Response0.150.280.360.510.680.851.061.21
0.140.200.360.530.630.800.911.22
0.190.230.340.540.640.711.091.29
0.190.250.370.450.610.850.931.24
0.170.230.330.570.650.941.091.18
0.160.230.380.490.680.831.121.24

Table 12.

Microbiological assay of vitamin B12 [61, 62].

Figure 11.

Representation of the dose‐response data for the microbiological assay of vitamin B12.

The transformation [62]

z=logxE21

can be used (Figure 12). The inspection of Figure 12, however, suggests the existence of a marked curvature. The graph of the residuals, the deviation of each point of the model, indicates that the straight line is incorrect, due to the observed systematic pattern. There is a tendency towards curvature, as it is not randomly distributed around zero. It should be assumed that the model is susceptible to improvement, requiring either higher‐order additional terms or a transformation of the data.

Figure 12.

Top: fitting a straight line to response versus logarithm of dose (microbiological assay of vitamin B12). Bottom: plot of the residuals against the logarithm of the doses for the straight‐line model.

If a second‐degree polynomial is fitted to the response data as a function of the logarithm of the dose, the adjustment to the naked eye seems adequate (Figure 13, top). The representation of the residuals as a function of the abscissa values (Figure 13, bottom), however, adopts a funnel shape. The non‐random pattern of residuals carries the message that the assumption of homogeneous (regular or constant) variance is not satisfied, which would require the application of the weighted least squares method, rather than simple linear regression.

Figure 13.

Top: fitting a second‐degree polynomial to the response versus logarithm of dose (microbiological assay of vitamin B12). Bottom: plot of the residuals against the logarithm of the dose for the second‐degree polynomial model:

The shape of Figure 13 (top) suggests a simple possibility, that of transformation [43] also in the response

u=yE22

A simple inspection of Figure 14 (top) now shows that the linear regression is valid throughout the entire range. Both transformations to achieve homogeneity of variance and normality (Tables 13 and 14) go together (hand in hand) and then both postulates are (almost) often simultaneously fulfilled, fortunately, on applying an adequate transformation.

Date typeTransformation
Poisson (counts) (y)y
Small counts (y)y+1or y+y+1
Binomial (0<P<1)asinP
Variance = (mean)2lny
Correlation coefficient0.5[ln(1+r)−ln(1−r)]

Table 13.

Transformations to correct for homogeneity and approximate normality [18, 162].

Estimated relationshipαλ=1αTransformation
s=ky¯^22−1Reciprocal
 s=ky¯^3/23/2−1/2Inverse square root
 s=ky¯^210Logarithmic
 s=ky¯^1/21/21/2Square root
 s=k01Without transformation

Table 14.

Transformations to stabilize variance [18, 158] W=(yλ1)/λ(λ0);W=lny(λ=0).

Figure 14.

Top: fitting a straight line to the transformed data, square root of the response against the logarithm of the dose (microbiological assay of vitamin B12). Bottom: plot of the residuals versus the logarithm of the doses for the straight‐line model applied to the data transformed in both axes.

The stabilization of variance usually takes precedence over improving normality [160]. As stated by Acton [86] ‘The gods who favour statisticians have frequently ordained that the world be well behaved, and so we often find that transformation to obtain one of these desiderata in fact achieves them all (well, almost achieves them¡)’.

Linear regression is a linear (in the parameters) modelling process. However, non‐linear terms may be introduced into the linear mathematical context by performing a transformation [162, 163] of the variables (Table 15). Note that when a transformation is used, a transformation‐dependent weight (Table 16) should be used (in addition to any weight based on replicate measurements). When a non‐linear function is capable of being transformed into another one linear, it is called ‘intrinsically linear’. Non‐linear functions that cannot be transformed into linear are instead called ‘intrinsically non‐linear’.

FunctionFormulaTransformationLinear form
Power functiony=αxby=logy
x=logx
y=logα+βx
Exponential grow modely=αeβxy=logyy=logα+βx
Logarithmicy=α+βlogxx=logxy=α+βx
Hyperbolicy=xαxβy=1/y
x=1/x
y=αβx
Logity=eα+βx1+eα+βxy=log(y1y)y=α+βx

Table 15.

Linearizable non‐linear functions [18, 162].

TransformationWeighting factor (*)
1yy4
lnyy2
y214y2
ey1e2y
logity2(1y)2

Table 16.

Weighting factors associated with a given transformation [2, 127, 128].

(*) in units of σ02/σy2; σ02is a proportionality factor, that is, the variance of a function of unit weight [2]


7. The variable that has not yet been discovered: the solubility of diazepan in propylene glycol

The study of the solubility of diazepan in mixed solvents [28] requires the representation of Beer’s law of a set of data corresponding to the solubility of diazepan in propylene glycol. The experimental data are shown in Table 17.

C (mg/mL)T (min)AC (mg/mL)T (min)A
16.07600.001.79912.860887.501.481
3.21526.000.33512.860892.751.503
6.430410.500.70012.860897.751.522
12.860821.501.48716.0760102.751.868
6.430433.500.67012.8608117.251.508
9.650039.251.0689.6500122.251.108
16.076045.751.8409.6500130.501.109
9.650050.751.0889.6500135.751.128
16.076056.751.8426.4304141.000.720
3,215267.250.3586.4304146.250.719
6.430471.750.7033.2152150.750.349
16.076077.751.8693.2152155.750.367
3.215282.500.345

Table 17.

Solubility of diazepan in propylene glycol (absorbance as a function of concentration and time) [28].

The relationship obtained between absorbance and concentration is (Figure 15)

Figure 15.

Top: absorbance as a function of concentration for (solubility) of diazepam. Middle: plot of the residuals as a function of the concentration. Bottom: plot of the residuals as a function of the measurement time (measurement order).

A=0.11767C0.003568E23

These data can be used to corroborate the previously made statement that the correlation coefficient is not necessarily a measure of the suitability of the model. The R2‐value of the above equation is 0.998 (r = 0.999). Many researchers would settle for this, but they would be wrong.

In spite of the high coefficient of correlation (r = 0.999), when the residuals are represented as a function of the numerical order in which the samples were measured, we obtain Figure 15 (bottom). The pattern obtained is not random by marking the residual trend with a positive slope. This behaviour is indicative of the situation in which the assumption of independence is not satisfied. The slope in a representation of the residuals as a function of the order of measure (time) indicates that a linear term must be included in the model.

Figure 16.

Top: residuals as a function of concentration for the extended model including the measurement time. Bottom: residuals as a function of time (order of measurement) for the model with the time included.

When time is included in the model, Eq. (21) results

A=0.070193+0.118394C+0.000336936tE24

giving rise to a value of R2 equal to 0.999.

When the residuals are calculated for this model and are plotted as a function of the concentration, a graph similar to that of Figure 15 (middle) is obtained (Figure 16, top). However, if the residuals are represented for this model as a function of time (which is reflected in the order in which the samples were measured), the resulting pattern is obtained in Figure 16 (bottom), in which it is observed that the independence of the error has been accommodated (compare Figure 15, bottom), and the fit has improved, although it could probably do so even more.

The order in the analysis time demonstrates the significant fact that a representation of the residuals allows the observation of the effect of the time that otherwise would not have been perceived. This is possible in the case of diazepam solubility because the researcher was careful to record the time to which the samples were measured.

The appearance of a pattern in residuals as a function of time in a study of Beer’s law could indicate that some contaminant is affecting, or that the light source is decaying, or perhaps that it has not yet been warmed. The pattern of the residuals indicates if there is a time‐dependent variable, but not the reason for that dependency, which must be ascertained, in its case.

8. Nickel by atomic absorption: all models are wrong

Nickel nitrate (II) hexahydrate reagent analysis (Merck) is used to prepare a standard solution of 1 g/L Ni. The salt of 5.0058 g is weighed into the analytical balance and brought into a 1‐L volumetric flask with ultrapure water. From this solution containing 1000 mg/L, a working solution contains 125 mg/L. Appropriate volumes of this solution (triplicates) are added to 25‐mL volumetric flasks to obtain the calibration curve, thinning with ultrapure water. The measurements are carried out in an ‘Analyst 200 Atomic Spectrometer’ operating in absorption mode with a Cu‐Fe‐Ni multi‐element Lumma lamp (Perkin Elmer), at 232 nm, with an acetylene air flame. The obtained absorbances, given below, are superior to those described in Perkin Elmer [164]. The measurements depend on the flow, for example, of the nebulizer system, different in each case.

Absorbance data (in arbitrary units) in the triplicate of aqueous solutions of Ni2+ in mg/L (ppm) are compiled in Table 18. It has been tried to adjust Eq. (2), third‐degree and fourth‐degree polynomial models (Figure 17) (left figures with mean and right values with individual values), observing that as the degree of the polynomial increases, the goodness of the adjustment increases, although the residuals detect pattern. There are no perfect models, but models more appropriate than others [165, 166]. It is possible to use rational form polynomials with the SOLVER function of Excel. Even so, the residuals show a pattern similar to that presented when a fourth‐degree polynomial is fitted to the data.

Ni2+ (ppm)Absorbance (arbitrary units)Ni2+ (ppm)Absorbance (arbitrary units)
2.50.2170.2070.22617.50.7430.7420.744
5.00.3990.3960.38920.00.7670.7670.771
7.50.5230.5130.51922.50.7870.7860.789
10.00.6180.6150.61225.00.8080.8130.807
12.50.6720.6640.66427.50.8200.8210.824
15.00.7130.7150.707300.8350.8350.831

Table 18.

Atomic absorption spectrophotometry calibration data of nickel(II).

Figure 17.

Atomic absorption spectrophotometry nickel(II) calibration data.

9. Final comments

Calibration is an essential part of every quantitative analytical method, with the exception of primary methods of analysis (isotope dilution mass spectrometry, coulometry, gravimetry, titrimetry and a group of colligative methods). The correct performance of calibration is a vital part of method development and validation. Parameter estimation models are often employed to obtain information concerning chemical systems, forming on this way a fundamental part of analytical chemistry. In those cases in which a wrong equation is fitted to data, the form of the residuals plot contains useful information which helps to modify and improve the model in order to get a better explanation of the data. Examples extracted from the literature show how residual plots reveal any violation of the assumptions severe enough as to deserve correction. As a matter of fact, some authors [12, 25, 28, 59, 96] are in favour of using residuals graphically to evaluate the inherent assumptions in the least squares method.

If there is a true linear relationship between the variables with the error symmetrically distributed, the sign of residuals should be distributed at random between plus and minus with an equal number of each. A plot of residuals allows checking for systematic deviation between data and model. Systematic deviations may indicate either a systematic error in the experiment or an incorrect or inadequate model. A curvilinear pattern in the residuals plot shows that the equation being fitted should possibly contain higher‐order terms to account for the curvature. A systematic linear trend (descending or ascending) may indicate that an additional term in the model is required. The ‘fan‐shaped’ residual pattern shows that experimental error increases with mean response (heteroscedasticity) so the constant variance assumption is inappropriate. This phenomenon may be approached by the weighted least squares method or by transforming the response. Time‐order analysis proves sometimes the more noteworthy fact that a residual plot permits the observation of a time effect that otherwise might not have become known. However, note that there are no perfect models, but models that are more suitable than others.

Many more sophisticated methods have been devised (standardized, studentized, jack‐knife, predicted and recursive residuals). However, in spite of their worth and importance they are considered beyond the scope of this chapter, devoted to a primer on residuals. The analyses presented in this chapter were mainly done using an Excel spreadsheet.

© 2017 The Author(s). Licensee IntechOpen. This chapter is distributed under the terms of the Creative Commons Attribution 3.0 License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.

How to cite and reference

Link to this chapter Copy to clipboard

Cite this chapter Copy to clipboard

Julia Martin, David Daffos Ruiz de Adana and Agustin G. Asuero (July 5th 2017). Fitting Models to Data: Residual Analysis, a Primer, Uncertainty Quantification and Model Calibration, Jan Peter Hessling, IntechOpen, DOI: 10.5772/68049. Available from:

chapter statistics

990total chapter downloads

1Crossref citations

More statistics for editors and authors

Login to your personal dashboard for more detailed statistics on your publications.

Access personal reporting

Related Content

This Book

Next chapter

An Improved Wavelet‐Based Multivariable Fault Detection Scheme

By Fouzi Harrou, Ying Sun and Muddu Madakyaru

Related Book

Frontiers in Guided Wave Optics and Optoelectronics

Edited by Bishnu Pal

First chapter

Frontiers in Guided Wave Optics and Optoelectronics

By Bishnu Pal

We are IntechOpen, the world's leading publisher of Open Access books. Built by scientists, for scientists. Our readership spans scientists, professors, researchers, librarians, and students, as well as business professionals. We share our knowledge and peer-reveiwed research papers with libraries, scientific and engineering societies, and also work with corporate R&D departments and government entities.

More About Us