8 Application of Principal Components Regression for Analysis of X-Ray Diffraction Images of Wood

We report on the use of Principal Component Regression (PCR), a least squares calibration method employing Principal Component Analysis (PCA) for the analysis of large sets of images obtained by wide angle X-ray diffractometry (WAXD) of wood. PCR multivariate models were created for prediction of microfibril angle (MFA), an important wood structural parameter. Multivariate regression techniques have previously been applied to Xray diffraction data for a variety of analytical problems. These include analysis of complex mineral mixtures (Karstang and Eastgate 1987) and for automated high-throughput analysis of powder diffraction profiles obtained by azimuthal averaging of 2D diffraction patterns (Gilmore, Barr et al. 2004). Multivariate classification from X-ray diffraction data has also been used for discrimination of insulin microcrystals (Norrman, Stahl et al. 2006) and brucite particle morphology (Matos, Xavier et al. 2007). The models created in this study used pixels from across the full area of the diffraction image as predictor variables, a technique which has not been found applied to diffraction data in the literature. A related technique, multivariate analysis of congruent images (MACI), was introduced by Eriksson, Wold et al. (2005) for discrimination and quantification of events within video image sequences. MACI uses wavelet analysis to decompose images and obtain predictive information which is correlated with frame-by-frame differences and that can describe properties such as the movement of an object or change in composition of a feature.


Introduction
We report on the use of Principal Component Regression (PCR), a least squares calibration method employing Principal Component Analysis (PCA) for the analysis of large sets of images obtained by wide angle X-ray diffractometry (WAXD) of wood.PCR multivariate models were created for prediction of microfibril angle (MFA), an important wood structural parameter.Multivariate regression techniques have previously been applied to Xray diffraction data for a variety of analytical problems.These include analysis of complex mineral mixtures (Karstang and Eastgate 1987) and for automated high-throughput analysis of powder diffraction profiles obtained by azimuthal averaging of 2D diffraction patterns (Gilmore, Barr et al. 2004).Multivariate classification from X-ray diffraction data has also been used for discrimination of insulin microcrystals (Norrman, Stahl et al. 2006) and brucite particle morphology (Matos, Xavier et al. 2007).The models created in this study used pixels from across the full area of the diffraction image as predictor variables, a technique which has not been found applied to diffraction data in the literature.A related technique, multivariate analysis of congruent images (MACI), was introduced by Eriksson, Wold et al. (2005) for discrimination and quantification of events within video image sequences.MACI uses wavelet analysis to decompose images and obtain predictive information which is correlated with frame-by-frame differences and that can describe properties such as the movement of an object or change in composition of a feature.This chapter describes four predictive PCR models, two that are species-specific, and two that incorporate multiple species.Differences between these models, and their relative accuracy, will be discussed.SilviScan will be described and the issues relating to management of the large amounts of data collected by the system will be examined.

Multivariate calibration
Principal component regression is a widely used multivariate calibration technique for creating predictive models by relating multivariate sets of measurements (X) to sets of dependent variables (y) (Hoskuldsson 1995).PCR uses the properties of principal 146 component analysis to reduce the dimensionality of the original data to form linearly independent factors (T) required for least squares regression analysis.In the case of X-ray diffraction analysis, modelling of a dependent variable using PCR relies on the the proportionality that exists between diffracted X-ray energy and the quantity of a compound or structural feature present within a sample.If there is a linear relationship with the predictand, then PCR will be a good choice for modelling the variable using X-ray diffraction images.A linear relationship between a dependent variable and diffraction data is valid providing the diffraction experimental parameters are within prescribed limits of a materials density (these limits are dependent on X-ray wavelength) and providing all other experimental conditions are kept constant (Tripp and Conrad 1972;Cave and Robinson 1998).
PCR involves three main steps.
Step 1 is data reduction using PCA to obtain scores (T) and normalised eigenvectors (P) (Equation 1 and 2).
Step 2 uses T as the independent variables matrix in the inverse least squares (ILS) calibration (Kramer 1998) (Equation 3 and 4), and step 3 produces the regression coefficients (in terms of the original X variables) required for predicting the properties () y of unknown samples (Equation 5).PCA is represented by Equation 1 (Brereton 2003).The eigenvector and score pairs are termed principal components (PCs) and are numbered from PC1 to PCn, in decreasing order of the variance associated with the vectors in the original data matrix.The matrix R is the residual data matrix which is the variance not associated with the principal components chosen, and which would generally describe noise.Equation 2 represents the projection of the eigenvectors onto the original data to obtain the reduced data (scores) T matrix: where m is the number of samples, n is the number of variables -in our case the total number of pixels in the image, c is the number of components and ' indicates the transpose.For analysis involving images, the term 'eigenimage' is used to denote the 2D version of the eigenvector that is obtained from PCA.
The equations required for the inverse least squares regression calibration step are: where y m,1 is the vector of dependent variables, b and b 1,c are the true and best estimates for the least squares coefficients, β 1,n are the regression coefficients in terms of the original X variables, ŷ are the predicted (best estimates of) y values from any other set of data and est y is the best estimation of the average y data.
The ILS calibration requires that the number of factors used (c) is less than or equal to the number of samples (m).Once the calibration is performed, y data may be predicted from unknown samples (X) using the regression coefficients β of the original X variables and the best estimation ( est y ) of the average y data.The est y value can be obtained using an added column vector of 1's in the T matrix and will equal the average of the y dataif the X matrix was mean centred prior to the PCA step.
The scores obtained from PCA are orthogonal and rule out modelling techniques that use cross-terms (T m,c1 × T m,c2 ).However, both cross and quadratic (X m,n × X m,n ) terms may be introduced into the initial X data, which in the case of large data sets comes at the risk of excessive memory use.Variables can be scaled prior to the PCA step by one of a number of techniques such as mean centering (subtracting the mean of a variable), variance scaling (equalising variance) and range scaling (normalising data to lie within a nominated range).Scaling techniques are used to improve the predictive capability of the data (Berg, 2006) and can highlight low variance variables that may otherwise be left in the R matrix during the PCA step.Vector normalisation (VN) is another scaling technique that normalises between samples (rather than variables) and involves the following transformation: where xvn is the normalised vector, x the original vector and xi the ith variable of the sample.
Removal of high leverage samples from the model data set before inclusion in the model training data is good practice as they influence the model in a disproportionate fashion.The leverages for each sample are the values of the diagonal of the 'hat' matrix (H) as seen in equation 8 (Dobson, 2002).
A useful measure of the accuracy of a multivariate regression model is the root mean square standard error of prediction (RMSEP) (Faber, Song et al. 2003): and is a measure of the average error between a test set of reference samples and a models predicted values ( ŷ ).This measure is biased high when using imprecise reference values (y) and can be corrected if the standard deviation of the reference data is known (Bro, Rinnan and Faber, 2005).Tests are available for determination of the per-sample prediction error, based on the leverage of the sample (see Bro, Rinnan and Faber, 2005).
The optimum number of components to keep in a model can be determined from the minimum in RMSEP versus the number of components (Martens and Naes 1989).Another common measure of prediction error is RMSEP from cross calibration.Cross calibration requires generation of multiple models with exclusion of a small number of samples (usually one) from each model and calculating the error as the average of the RMSEP of the excluded samples for all models.This technique is not practical or necessary when data sets are large and in our case where there would be a large number of models to be made (each potentially requiring PCA on the subset of data), the computational requirements would become excessive.

X-ray diffraction of wood and the SilviScan system
Important factors affecting the diffraction patterns of wood samples include cell shape and microfibril angle (MFA) (Evans 1999;Lichtenegger, Reiterer et al. 2001), cellulose crystallite dimensions and cellulose crystallinity (Wada, Okano et al. 1997;Washusen and Evans 2001;Andersson, Serimaa et al. 2003;Garvey, Parker et al. 2005).These properties have been shown to influence many of the bulk properties of wood including modulus of elasticity (MoE) for structural timber (Evans 2006) and important properties such as pulp yield, a useful metric for production capacity in the pulping industry (Downes, Evans et al. 2003).The fast and accurate estimation of these attributes is useful in large scale surveys of wood properties for genetic selection trials and forestry management (Lindstrom, Harris et al. 2004;Thumma, Nolan et al. 2005).
Wide angle x-ray diffractometry has been developed as an integral part of the highthroughput instrumentation known as SilviScan, now in its 3rd generation.SilviScan consists of three complimentary techniques, transverse (radial-tangential) surface microscopy and image analysis, X-ray densitometry and WAXD.These three techniques are used to analyse 2mm by 7mm radial sections, typically cut from 12mm diameter increment cores taken from standing trees (Fig. 1).The interaction of an X-ray beam with the wood fibre structure gives diffraction images that contain information about the helical orientation of the fibres (MFA).A geometrical relationship between MFA and the variance of the observed 002 azimuthal peaks (Φ direction, Fig. 1) is currently used by SilviScan software to determine MFA (Evans 1999).In addition, 3D cellular orientation is estimated (roll and pitch relative to the x-ray beam).Roll is calculated from the rotation of the diffraction pattern around the beam direction and pitch is calculated from the distortion of equatorial reflections towards the meridional axis.Pitch (fibre tilt in the direction of the X-ray beam) is important as it causes asymmetric peak broadening and an overestimation of MFA (Stuart and Evans 1995;Evans, Stringer et al. 2000), which has to be corrected for during SilviScan analysis.This work demonstrates the applicability of multivariate statistical calibration methods for determination of MFA from WAXD images obtained on SilviScan.The same method can be used to extract information on fibre orientation, crystallite size and any other property that affects the images.Occasionally, when MFA is very high, the diffraction patterns have almost no azimuthal intensity variation, making them very difficult to analyse by any method that relies on the determination of azimuthal peak widths.The consequences of this problem are data 'dropouts' and an example of them are seen later in the results section.Thus overcoming this problem through calibrated multivariate techniques for estimation of MFA is of interest.

SilviScan
The SilviScan 3 instrument scans samples in three stages.First, automated microscopy and image analysis of the polished top transverse surface (radial-tangential plane) gives growthring boundary orientation, ray cell orientation, and average radial and tangential fibre diameters binned in 25 micron intervals.Second, automated X-ray scanning densitometry (customised 1392×1040 pixel camera, Photonic Science, UK) gives density variation binned in 25 micron intervals and normalised to the actual average sample density obtained by independent gravimetric analysis.Third, automated scanning X-ray diffractometry (customised 1392×1040 pixel intensified camera, Photonic Science, UK) using a nickelfiltered Cu source and 200 micron diameter capillary-focussed beam, gives microfibril angle (MFA) and fibre axis orientation in 100 micron steps.MFA was corrected for fibre orientation in the beam direction.

Wood samples
All wood samples (two softwoods and one hardwood) were taken from previous projects for which the SilviScan data was available.The samples were in the form of 2mm wide (tangential dimension) and 7mm high (longitudinal dimension) pith-to-bark (radial) strips.Samples varied in length from 20 mm -200 mm.30-year-old Pinus caribaea var.hondurensis ( P. caribaea) from Beerburrum, in south-east Queensland, Australia.Twelve radial samples from an original 520 (65 trees, 8 heights per tree) were selected for high resolution analysis on SilviScan-3 at the 4m sampling height.18486 diffraction images were taken of these samples, 10000 used for creating the models and 6925 images were used for model verification.
23-year-old Pinus radiata (P .radiata)from Tallaganda, Canberra, Australia were taken from a set of previously well characterised samples as 2mm tangential, 7 mm longitudinal and 20 mm radial strips (Schimleck 2002).The resulting diffraction dataset consisted of 11822 images from 116 paired ends of the same sample.
Eucalyptus delegatensis (E.delegatensis) from East Gippsland, Victoria, Australia were taken as an example of a hardwood (Evans and Ilic 2001).The resulting diffraction image dataset consisted of 10638 images from 102 paired ends of the same sample.

Data analysis
Principal component analysis was undertaken using in-house software developed using C++ and OpenCL and making use of high end graphics processing units (Bowden 2010).Images were reduced to 128 by 128 pixels by selecting 1 out of 8 pixels in each dimension.This resulted in a 16384 element vector for each image in the data sets.Inverse least squares software was written in-house and the combined system used hierarchical data format (HDF) files for data storage and management.HDF Viewer version 2.7 and Microsoft Excel 2003 was used for visualisation and data filtering (HDF 2000).
The modelling procedures involved the following: Two datasets were used to create the models, one from the P. caribaea data set (10000 images, ~7 wood samples) and another from a combination of P. radiata (5000 images), E. delegatensis (5000 images) and P. caribaea (10000 images), making a 20000 image data set.These are referred to as the 'single-species' and 'multi-species' data sets.From each data set, models were made using mean centred image data of both vector normalised (referred to as 'VN' models) and unmodified data.Models were then tested with 6925 P. caribaea and 5000 samples of either P. radiata or E. delegatensis.RMSEP model verification was carried out on a per species basis with models from both data sets.

Results
Fig. 2 shows two diffraction images at opposite ends of the range of MFA values seen in P. caribaea samples.Fig. 2a is a low MFA sample (9°), with very narrow azimuthal peaks (101, 101 and 002 in order from the centre) close to the equator.Fig. 2b is an example of a high MFA sample (30°) and the azimuthal peaks are much broader.The arrows (Fig. 2a) on the 002 reflections indicate the direction of azimuthal broadening in high MFA samples.Fig. 3 shows the first 18 eigenvectors/eigenimages from PCA of the single-species dataset.The eigenimages show various symmetry features involving rotation, reflection and inversion.These symmetry features are characteristics of the diffraction pattern derived from wood samples and result from the major sources of variance in the diffraction images.For example, the first eigenimage can be interpreted as the primary contribution to variation in equatorial intensity with complementary variation in MFA (azimuthal peak width).
The eigenimage PC2 arises from differential absorption of equatorially scattered X-rays.This differential is caused mainly by rotation of the samples in the horizontal plane and strong horizontal variations in wood density (annual rings) in the radial direction.PC3 is the result of changes due to tilting (roll) of the fibre around the X-ray beam, causing the main diffraction peaks to be rotated away from the equator and described by rotational symmetry combined with a lack of reflection symmetry.Some higher order PCs such as PC7 do not show rotational symmetry and probably arise from fibre tilt in the direction of the X-ray beam (pitch) and rotation around the fibre axis.As discussed in section 2 above, the ILS modelling procedure incorporates the eigenimage vectors into a predictive model of the dependent variable using the proportionality found within the T matrix (Equation 4).The regression coefficients (β) such as those seen in Fig. 5 are also generated.These coefficients show a distinct pattern of strongly negative values at the equatorial positions and strongly positive values moving into the equatorialmeridian arc.Fig. 5. Regression coefficients obtained from the single-species model using 17 PCs.The image shows distinct positive bands placed diagonally across the image and extending to the limit of the 002 peak 2θ value.For these images both the intensity and the sign of the values is important.This pattern follows the behaviour of the original data (Fig. 2), where high MFA samples exhibit broad, low intensity equatorial peaks and low MFA samples show more intense, narrow peaks.Thus the model can predict the MFA of unknown samples by summing the product of each pixel value in the unknown image and the regression coefficient at that pixel position, and adding the best estimate for the mean MFA of the calibration set.It is understood that the unknown image is collected on a similar X-ray camera, with matching effective experimental geometry and pre-processed in the same way as the calibration images.Fig. 6 shows that the RMSEP of the P. caribaea test data is reduced only when specific eigenimages are added to the model.The contributing PCs are seen to be PC1, 4, 5, 6, 10, 14, 16, and 17.When referenced against the eigenimages in Fig. 3, the influential PCs are seen to have a dominant characteristic of two orthogonal mirror planes, one through the meridian and another at the equator.This symmetry arises from the vertically aligned wood fibres and the horizontal scattering of x-rays from the crystalline cellulose c-axis planes which, on average are aligned with the fibre axis.Fig. 6.RMSEP derived from the 10000 image single-species model using 6925 P. caribaea verification samples shows that vector normalisation gives consistently better prediction than non-vector normalised (mean centred only) data.The vector normalised samples reach a minimum on the addition of PC17 to the model.Fig. 7 shows the scatter plots of predicted MFA versus the reference data MFA for the three species as obtained using the single-species data set.It is seen that the single species model predicts P. caribaea reference data well, with the predicted data following the line of perfect correlation adequately (RMSEP = 2.7° at 17 PCs for VN samples).The prediction of P. radiata samples is also found to be accurate (RMSEP = 1.7° using 19 PCs for the VN samples), with a deviation in the slope producing an overestimation of high MFA samples.The E. delegatensis samples do not fair so well, with a consistent shift in the predicted MFA values over the reference values (RMSEP = 5.4° using 37 PCs in the VN samples).Analysis shows that much of the error in the E. delegatensis samples is due to the y value obtained from the P. caribaea (17.1°) not being a good for the E. delegatensis y value (11.9°).Correcting for this difference and using it in the model (i.e.offsetting the E. delegatensis predicted data by 17.1-11.9)results in a reasonable RMSEP of 2.1° using 18 PCs.Fig. 8. Plot of predicted MFA versus the reference MFA data for a selected region of high MFA within the early growth wood of a P. caribaea sample.Model is from 17 PCs using the VN single-species data only.X axis represents sample number moving along a wood sample.
The data in Fig. 8 shows the advantage of using models such as these as a verification tool for data measured by other procedures.Three data 'dropouts' can be observed in a region of high MFA in the reference MFA data in Fig. 8.These problems are possibly due to high fibre pitch, which distorts the diffraction patterns and thus cause the direct determination of MFA to fail occasionally.It is seen however, that the multivariate model has predicted MFA to follow a profile under the reference data and does not have the problem of dropouts.Fig. 9 shows that improvements can be made to the predictive potential of models by using multiple species in the calibration data.Fig. 9a shows that the multi-species model marginally increases the accuracy of prediction of P. radiata samples, notably drawing the high MFA samples onto the line of perfect correspondence (RMSEP = 1.2° using 37 PCs, VN samples).Fig. 9b shows that the E. delegatensis samples are still not well modelled (RMSEP = 4.2° using 18 PCs).The offset is predominately due to the difference in the average MFA between the multi-species dataset (15.2°) and the E. delegatensis reference samples (12.1°).This lack of fit indicates that the average values used in offsets of the models must be carefully monitored and that it may be necessary to generate species specific models.

Discussion
The PCA modelling procedure reduced the dimensionality of the image dataset from the 16384 pixels present in the original images to a sum of contributions of a smaller number of principal components.Each component found had a characteristic (and mutually orthogonal) distribution of intensities over the 2D image plane.The PCA technique selects the features of these components in order of contributing variance to the (pre-processed) data set.Due to the broad nature of the diffraction features in wood samples, PCA was able to obtain PCs that describe most of the variance in the images.The inverse least squares procedure then used the newly formed variables as predictors to produce a least squares model which can accurately predict MFA.Vector normalised image data produced more accurate models in all experiments than unprocessed, mean centred only data.An R 2 of 0.96 was obtained for P. caribaea samples in both single-species and mixed-species models.P. radiata R 2 reached a maximum around 0.93 for both models and E. delegatensis R 2 increased from 0.84 to 0.91 when the model was produced from the multi-species data set.
Although normalisation procedures are effective, use of regression coefficients with data from other X-ray systems will suffer from errors due to between system differences.These problems are typical of PCR modelling techniques that are sensitive to the precise instrumental system being calibrated.
A disadvantage of this analysis method for large image datasets is the computer memory requirements needed in creation of the models.A single wood sample could generate up to 3000 1.4 Mpixel images, and there may be hundreds of such samples in a project.These difficulties can be overcome through use of high performance cluster-based computing systems and the use of graphics processing units (GPUs) to speed up the PCA data reduction procedure.In the analysis above, images were down-sized to 128 by 128 pixels by discarding pixels.Although not shown, the use of more pixels in the analysis or use of averaging of the pixels should improve the resulting accuracy of models produced.The data analysis system found benefit from use of the HDF data format as it allowed both data and www.intechopen.comPrincipal Component Analysis -Engineering Applications 156 results to be grouped, with each experiment recorded and similar experiments able to be filed together.The HDF format allows data importation into multiple other popular data analysis programs and thus aides data analysis by researchers with differing software access.
The computational burden of the final prediction part of the procedure is relatively low compared with mapping Φ-θ images into orthogonal axes and segmentation of the image as is presently used by the geometric method of MFA prediction within the SilviScan system.Thus as a technique to increase the robustness of a high throughput system, whole image multivariate property prediction has potential.
A further advantage of multivariate techniques is that standard methods for outlier detection can be used and thus samples that lie outside the calibration can be highlighted and scrutinized by an operator.The ability to emphasize unusual samples is of interest in a high throughput system, where the amount of data collected often precludes a thorough assessment of all results obtained.The technique should tolerate noisy images and allow faster sample throughput.

Conclusions
The calibration techniques used here open the way for rapid determination of difficult to obtain wood properties.Measurements from instruments that can accurately measure properties of interest, albeit more slowly can be used for calibration and subsequently used in the high throughput instrument.Models will have to be carefully designed and tested such that they will be valid for a species of interest.These models can be incorporated into the instrument software for checking data integrity.In addition, the signal-to-noise advantage inherent in employing the whole image as a predictor could be used to obtain accurate values at a faster rate.

Fig. 1 .
Fig. 1.Typical geometry of wood samples used in relation to the tree and the diffraction pattern produced.Miller indices associated with the observed reflections are indicated.The commonly used 040 meridional reflection is outside the range of the camera and is unreliable when fibre orientation varies as it does in core samples.

Fig. 4
Fig.4shows a logarithmic plot of eigenvalue (the dot product of each scores vector with itself) for successive PCs from the mean-centred, vector-normalised single-species data.The eigenvalues rapidly decrease by over 3 orders of magnitude.

Fig. 7 .
Fig. 7. Reference versus predicted scatter plots of MFA from the single-species model using 17 PCs and VN samples for the predictive model.a. P. caribaea, b.P. radiata and c.E. delegatensis reference samples.

Fig. 9 .
Fig. 9. Reference versus predicted scatter plots of MFA being predicted from the 200000 image multi-species model a. Prediction of P. radiata samples MFA from model using 37 PCs and b.Prediction of E. delegatensis samples using 17 PCs in the model.