Evaluating Abiotic Factors Related to Forest Diseases: Tool for Sustainable Forest Management

Sustainable forest management (SFM) is not a new concept. However, its popularity has increased in the last few decades because of public concern about the dramatic decrease in forest resources. The implementation of SFM is generally achieved using criteria and indicators (C&I) and several countries have established their own sets of C&I. This book summarises some of the recent research carried out to test the current indicators, to search for new indicators and to develop new decision-making tools. The book collects original research studies on carbon and forest resources, forest health, biodiversity and productive, protective and socioeconomic functions. These studies should shed light on the current research carried out to provide forest managers with useful tools for choosing between different management strategies or improving indicators of SFM.

weather stations or local climate models may provide more useful data since they have a greater level of detail. The greater resolution data may significantly improve risk assessment (Krist et al., 2008).

Developing the database
Developing risk models require both, forest health condition and abiotic factors, to be combined in a geographic information system (GIS). In this chapter, tools from Arcview 3.3 and ERDAS software are described, but newer software for editing GIS have similar tools. On the other hand, researchers around the world are developing free GIS, which now or in the future will probably have the same tools. The database must include information from training sites, i.e. geo-located forests patches whose health condition and abiotic factors are known. Training sites can be selected from field checking (La Manna et al., 2012) or from the map of species distribution and health condition (La Manna et al., 2008b). The patches should have an homogeneous health conditions; and training sites should include diseased and healthy patches or just diseased ones, depending on model requirements. The selection of training sites requires a proper sampling method, covering the range of host and abiotic conditions in order to minimize bias. A stratified-random sampling or a random sampling should be applied, and the extension Table Select deluxe tools v.1.0 of Arc View software can be useful for selection. The abiotic factors should be mapped in all the study area. Environmental features of training sites are needed to build the database; but the environmental features along all the area of distribution of the forest species are needed to build the risk map. Figure 1 schematizes the process for building database. Once the environmental layers are complete, the mean values of each site attributes layer can be extracted by the Zonal attributes tool of ERDAS software for each training site. This tool enables to extract the zonal statistics (mean, standard deviation, minimum and maximum) from a vector coverage and save them as polygon attributes.

Building risk models
There are different modeling techniques for developing risk model based on abiotic factors, with predictive performance varying according to the focus of the study (Brotons et al., 2004;Manel et al., 2001;Pearson et al., 2006Pearson et al., , 2007Phillips et al. 2006). Data requirements vary between the techniques. While some models require data of presence and absence of the disease (i.e., diseased and healthy training sites), others need only presence data. The former models are appropriate if absence of the disease is due to environmental restrictions, while the latter approach is appropriate when factors other than environmental variables (e.g. history of spread) explain most of the absences. In some cases, absence data are doubtful; for example for forest diseases that are manifested earlier in the lower stem and latter in the crown, delaying detection by remote sensing. In these cases, health condition of training sites should be obtained from the field (La Manna et al., 2012), since failure to detect absences results in false negatives, which change mathematical functions describing habitats. Among the available modeling techniques, three are described in this chapter on the basis of their requirements on disease presence or disease presence/absence data: Mahalanobis distance (requires only presence data), Maxent (requires only presence data and generates pseudo-absences) and Logistic regression (based on presence/absence data). These methods are inherently flexible, being applicable to a wide range of ecological questions, taxonomic units, and sampling protocols and they produced useful predictions in other studies (DeVries, 2005;Elith et al., 2006;Hellgren et al., 2007;La Manna et al., 2008b, 2012Marsden & Fielding, 1999;Pearson et al., 2006;Schadt et al., 2002b).

Mahalanobis distance model 4.1.1 Brief description of the mathematical model
Mahalanobis distance, which requires only presence records, projects the potential distribution of the disease into a geographical space without giving weight to observed absence information (Pearson et al., 2006). Mahalanobis distance was introduced by Mahalanobis (1936) and it is the standardized difference between the values of a set of environmental variables describing a site (rasterized cell or pixel in a GIS) and the mean values for those same variables calculated from points at which the disease was detected (Browning et al., 2005;Rotenberry et al., 2006). Mahalanobis distances are based on both the mean and variance of the predictor variables, plus the covariance matrix of all the variables. Mahalanobis distance is calculated as: where: D 2 =Mahalanobis distance x=Vector of data m=Vector of mean values of independent variables C -1 = Inverse Covariance matrix of independent variables T=Indicates vector should be transposed The greater the similarity of environment conditions in a point with mean environmental conditions in all training points, the smaller the Mahalanobis distance and the higher the disease risk at that point. Mahalanobis distance has been used in studies employing a GIS to quantify habitat suitability for wildlife and plant species (DeVries, 2005;Johnson & Gillingham, 2005;Hellgren et al., 2007).

Applying Mahalanabis distance in a GIS
Since Mahalanobis distance considers points (and not patches), the polygon layer with the diseased training sites, selected in the field or from the map, must be converted to a point layer. This conversion is done founding the point at the center of each patch, by "Convert shape to centroid" option from Xtool ArcView extension. The vector of mean values for each site variable and the variance/covariance matrix for site variables is generated from this point layer ( Figure 2). The Mahalanobis distance for each cell of the study area is calculated based on this matrix with Mahalanobis distances extension for ArcView (Jenness, 2003). This extension may be freely downloaded from: http://www.jennessent.com/arcview/mahalanobis.htm. For an easier interpretation of results, the Mahalanobis distance statistic can be converted to probability values rescaling to range from 0 to 1 according to χ 2 distribution (Rotenberry et al., 2006).

Maximum entropy species distribution modelling (Maxent) 4.2.1 Brief description of the mathematical model
Maxent, as Mahalanobis distance, is a model requiring presence data, but it generates "pseudo-absences" using background data as substitute for true absences (Phillips and Dudík, 2008). Thus, Maxent formalizes the principle that the estimated distribution must agree with everything that is known (or inferred from the environmental conditions at the occurrence localities) but should avoid placing any unfounded constraints. The approach is Diseased training sites covering the diseased forests distribution Mahalanobis distance for the study area is calculated based on training sites characteristics Environmental layers Mahalanobis distance values in the area of interest (i.e. forest distribution)

Fig. 2. Schematic representation of Mahalanobis distance procedure
to find the probability distribution of maximum entropy (i.e, closest to uniform, or most spread out), subject to constraints imposed by the information available regarding the observed distribution of the disease and environmental conditions across the study area. The Maxent distribution belongs to the family of Gibb's distributions and maximizes a penalized log likelihood of the presence sites. The mathematical definition of Maxent and the detailed algorithms are described by Phillips et al. (2006), Phillips & Dudík (2008) and Elith et al. (2011). Maxent has been applied to modeling species distributions and disease risk with good performance (La Manna et al., 2012;Pearson et al., 2007;Phillips & Dudík, 2008;Phillips et al., 2006).

Applying Maxent in a GIS
Maxent can be freely downloaded and used from: http://www.cs.princeton.edu/ ~schapire/maxent/ and it is regularly updated to include new capabilities. A friendly tutorial explaining how to use this software is provided in the web page, including a Spanish translation.
To perform a run, a file containing presence localities (i.e. diseased training sites), and a directory containing environmental variables need to be supplied. The implementation of Maxent requires the conversion of the files to proper formats. The file with the list of diseased training sites must be in csv format, including their identification name, longitude and latitude. The environmental layers must be saved as ascii raster grids (i.e. .asc format) and the grids must all have the same geographic bounds and cell size.
Environmental grids can be saved as ascii file by "Export data source" tool of ArcView. Maxent must be run following the detailed information included in the tutorial (Phillips et al., 2005). Maxent supports three output formats for model values: the Maxent exponential model itself (raw), cumulative and logistic. The logistic output format, with values between 0 and 1, is easier interpreted and it improves model calibration, so that large differences in output values correspond better to large differences in suitability (Phillips & Dudík, 2008).

Logistic regression model 4.3.1 Brief description of the mathematical model
The logistic regression is a generalized linear model used for binomial regression, and requires presence/absence data. What distinguishes a logistic regression model from the linear regression model is that the dependent variable is binary or dichotomous (Hosmer & Lemeshow, 1989). The binary dependent variable is disease occurrence (i.e., diseased training site; y=1) and disease absence (i.e., healthy training site; y=0). In contrast to others described models, Logistic regression projects the potential distribution of the disease onto a geographical space whereby information regarding unsuitable conditions resulting from environmental constraints is inherent within the absence data (Pearson et al., 2006). Logistic regression predicts the probability of occurrence of an event by fitting data to a logistic curve, presenting the following formula: where P is the probability of disease occurrence 0 is the Y-intercept 1... n are the coefficients assigned to each of the independent variables (V1… Vn) Probability values are calculated based on the equation below, where e is the natural exponent: P= e logit(P) / 1 + e logit(P) A comprehensive description of logistic regression and its applications is presented by Hosmer & Lemeshow (1989). Figure 3 shows a graphical example of a logistic regression model based on presence/absence data of a disease and a soil feature as independent variable.

Applying logistic regression in a GIS
From the database combining health condition and abiotic factors from training sites, the logistic regression model can be performed using common statistical software, as SPSS, SAS, Infostat, or free software packages. For example, Infostat is a friendly and economic statistical software and it offers a version that can be freely downloaded from: http://www.infostat.com.ar. The output of logistic regression analysis shows the coefficients assigned to each of the environmental variables (V1… Vn), and the probabilities values for each cell of the study area can be obtained in the GIS. Calculations can be done with "Calculate maps" tool from Grid Analyst extension of ArcView, considering site layers in grid format ( Figure 4). Thus, a grid with probabilities of disease occurrence is generated according to the logistic model.

Evaluating abiotic factors selecting the most important variables
An advantage of Maxent and the logistic regression models respect to Mahalanobis distance, is that the former allow easily discriminating the abiotic factors most related to the disease and choosing the better combination of variables. As mentioned above, environmental variables included a priori in the models depend on the knowledge about the disease. However, not all the variables considered a priori could be equally important for quantifying the disease risk at the landscape scale. Maxent allows detecting which variables matter most, calculating the percent contribution to the model for each environmental variable (Phillips et al., 2005). As alternative estimates of variable´s weight, a jackknife test can also be run by Maxent. Figure 5 shows an example of jackknife test, where the environmental variable "agua-move" appears to have the most useful information by itself (blue bar). The environmental variable that decreases the gain the most when it is omitted is also agua_move (light blue bar), which therefore appears to have the most information that is not present in the other variables. In the case of logistic regression the better combination of variables can be chosen according to the best subsets selection technique (Hosmer and Lemeshow, 1989), the lowest Akaike information criterion (AIC) (Burnham & Anderson, 1998), the greatest sensitivity (i.e., proportion of correctly predicted disease occurrences) or the stepwise method (Steyerberg et al., 1999).

Assessment of model performance
The predictive performance of modeling algorithms may be very different (Brotons et al., 2004;Manel et al., 2001;Pearson et al., 2006Pearson et al., , 2007Phillips et al., 2006). Differences could be related to the intrinsic properties of mathematical functions inherent to each model and to the various assumptions made by each algorithm when extrapolating environmental variables beyond the range of the data used to define the model (Pearson et al., 2006). Further, the set of data for running the models differs according to consider presence or presence/absence data. Receiver operating characteristic (ROC) curves and Kappa statistic are index widely used for assessing performance of models. ROC curve procedure is a useful way to evaluate the performance of classification schemes in which there is one variable with two categories by which subjects are classified. The area under the ROC curve (AUC) is the probability of a randomly chosen presence site being ranked above a randomly chosen absence site. This www.intechopen.com procedure relates relative proportions of correctly and incorrectly classified predictions over a wide and continuous range of threshold levels (Pearce & Ferrier, 2000). The main advantage of this analysis is that AUC provides a single measure of model performance, independent of any particular choice of threshold. AUC can be calculated with common statistical software. ROC plot showed in Figure 6 is obtained by plotting all sensitivity values (true positive fraction) on the y axis against their equivalent (1-specificity) values (false positive fraction) on the x axis. Specificity of a model refers to the proportion of correctly predicted absences. ROC analysis has been applied to a variety of ecological models (Brotons et al., 2004;Hernández et al., 2006;La Manna et al. 2008b, Pearson et al., 2006Phillips et al., 2006). Values between 0.7 and 0.9 indicate a reasonable discrimination ability considered potentially useful, and rates higher that 0.9 indicate very good discrimination (Swets, 1988). If absence data are not available, AUC may also be calculated with presence data and pseudo-absences chosen uniformly at random from the study area (Phillips et al., 2006). However, counting with both true absence and presence sites is better for evaluating model performance (Fielding & Bell, 1997).

ROC Curve
Diagonal segments are produced by ties. Kappa statistic is another index widely used (Loiselle et al., 2003;Hérnández et al., 2006;Pearson et al., 2006), that can be calculated with common statistical software. The Cohen's Kappa and Classification Table Metrics 2.1a, an ArcView 3x extension, may also be useful and can be freely downloaded from: http://www.jennessent.com/arcview/ kappa_stats.htm. Cohen's kappa is calculated at thresholds increments, e.g. increments of 0.05, from 0 to 1, and the maximum Kappa for each model is considered. Kappa values approaching 0.6 represent a good model (Fielding & Bell, 1997). The models should be run on the full set of training data, to provide best estimates of the disease's potential distribution (Philips et al., 2006). However, in order to assess and to compare the model performance, models should be run with just a portion of the training sites and the rest of data should be used for the assessment. For each model, some (e.g. ten) random partitions of data are done maintaining the remaining 25% of training sites for performance assessment. Then, AUC and Kappa values are calculated for each random set of assessment data and for each model, and they are compared between models by nonparametric analysis (Philips et al., 2006). The performance of the three models described in this chapter (i.e. Mahalanobis distance, Maxent and Logistic Regression) was compared for modeling a forest disease in Patagonia (La Manna et al., 2012). Results showed that all the models were consistent in their prediction; however, Maxent and Logistic regression presented a better performance, with greater values of AUC and Kappa statistics; and logistic regression allowed the best discrimination of high risk sites. Studies that compared presence-absence versus presence-only modeling methods, suggest that if absence data are available, methods using this information should be preferably used in most situations (Brotons et al., 2004). However, Maxent is considered as one of the best performing models (Elith et al. 2006;Hernández et al., 2006;Pearson et al., 2006;Phillips et al., 2006), and Mahalanobis distance also provided good results in conservation studies (DeVries, 2005;Johnson & Gillingham, 2005;Hellgren et al., 2007). The performance of the risk models may greatly vary in each case and forest disease.
Building and comparing models based on different algorithms allow finding the best.

Mapping the risk. Selecting thresholds
The three risk models presented in this chapter have as result grids with probabilities values of disease occurrence, varying between 0 and 1. However, for proposing management criteria is important to define what probability represents a high risk of disease. 0.4?, 0.5?, 0.7?... In order to convert quantitative measures of disease risk (i.e., probability) to qualitative values (i.e., low, moderate or high risk) threshold values must be selected. A possible criterion is to define thresholds by maximizing agreement between observed and modeled distributions for the sampled dataset. Sensitivity (the proportion of true positive predictions vs. the number of actual positive sites) and specificity (the proportion of true negative predictions vs. the number of actual negative sites) are calculated at different thresholds according to AUC coordinates. The threshold at which these two values are closest can be adopted. This approach balances the cost arising from an incorrect prediction against the benefit gained from a correct prediction (Manel et al., 2001), and is one of the recommended criteria for selecting thresholds (Liu et al., 2005). The lowest predicted value associated with any one of the observed presence records can also considered as a threshold (i.e, lowest presence threshold) (Pearson et al., 2007). This approach can be interpreted ecologically as identifying pixels predicted as being at least as suitable as those where the disease presence has been recorded. The threshold identifies the maximum predicted area possible whilst maintaining zero omission error in the training data set. Using the two thresholds, three risk categories can be defined: low (with p values lower than the lowest presence threshold); moderate (p values between the lowest and the sensitivity-specificity approach thresholds); and high risk (p values greater than the sensitivity-specificity approach threshold) (La Manna et al. 2012). Risk maps of disease occurrence can be generated for each model by reclassifying the model outputs, using Grid analyst extension of ArcView software.

Conclusions
Forest diseases are key determinants of forest health, and information about disease presence and potential distribution are important to any management decision. Risk maps are more likely to be used if they addresses the same scale at which management decisions are made. Stand scale management is increasingly being supplemented or replaced by landscape-scale management (Lundquist, 2005). Forest diseases risk assessment provides important information to the forest services that makes critical decisions on the best allocation of often-scarce resources. Risk models for pine wilt disease (Bursaphelenchus xylophilus) in Spain allowed planning control actions and preventing to plant susceptible species in the high risk areas (Fernández & Solla, 2006). Risk models for sudden oak death in California provide an effective management tool for identifying emergent infections before they become established (Meentemeyer et al., 2004). Risk models for economically important South African plantation pathogens allowed to asses the impact of climate change on the local forestry industry (Van Staden et al., 2004). Risk maps for A. chilensis disease in a valley of Patagonia allowed to detect healthy forests at risk only inside protected areas. These results allowed to suggest management actions for cattle and logging in disease-prone sites. This risk map also provided useful information for preventing restock in areas where the risk is greatest (La Manna et al., 2012). Risk models discussed in this chapter allowed the evaluation of abiotic factors related to the disease. This kind of models provides important information, which can be improved if knowledge about the biology and spreading of a causal biotic agent is available. It is important to know whether the forest pathogen under study is endemic or exotic. If it is exotic, the susceptibility must be assessed, based on the biological availability of a host and the potential for introduction and establishment of the disease within a predefined time frame. For this evaluation, the connectivity between patches may be key (Ellis et al., 2010).
On the other hand, if it is endemic, the disease is already established throughout a region, and then a susceptibility assessment is not required because the potential or source for actualized harm is assumed to be equal everywhere (Krist et al., 2006). For both endemic and exotic diseases, mortality occurrence may vary greatly depending on site and stand conditions, and models like those shown in this chapter are a good tool for assessing risk. Variables included in the models should be carefully pre-selected according to the previous knowledge about the disease. These models (i.e., Mahalanobis distance, Maxent and Logistic Regression) also admit variables like distance to roads, or distance to foci of infection, that could be important for spreading of infectious diseases.

Acknowledgment
The publication of this chapter was funded by Universidad Nacional de la Patagonia San Juan Bosco (PI 773).