Formulas for the three spatial covariance functions used in this analysis.
Climate change is increasing variation in freshwater input and the intensity of this variation in estuarine systems throughout the world. Estuarine salinity responds to dynamic meteorological and hydrological processes with important consequences to physical features, such as vertical stratification, as well as living resources, such as the distribution, abundance and diversity of species. We developed and evaluated two space-time statistical models to predict bottom salinity in Pamlico Sound, NC: (i) process and (ii) time models. Both models used 20-years of observed salinity and contained a deterministic component designed to represent four key processes that affect salinity: (1) recent and long-term fresh water influx (FWI) from four rivers, (2) mixing with the ocean through inlets, (3) hurricane incidence, and (4) interactions among these variables. Freshwater discharge and distance from an inlet to the Atlantic Ocean explained the most variance in dynamic salinity. The final process model explained 89% of spatiotemporal variability in salinity in a withheld dataset, whereas the final time model explained 87% of the variability within the same withheld data set. This study provides a methodological template for modeling salinity and other normally-distributed abiotic variables in this lagoonal estuary.
- space-time model
- spatial covariance
- freshwater inflow
- process-based model
Estuarine salinity responds to dynamic meteorological and hydrological processes  with important consequences to physical features, such as vertical stratification, as well as living resources, such as the distribution, abundance and diversity of species [2, 3, 4, 5]. For example, relatively low mixing and subsequent salinity stratification can lead to hypoxia in areas where organically-rich sediments are not adequately re-oxygenated, causing emigration of mobile fauna and degradation of ecosystem functions [5, 6, 7, 8, 9]. Rapid salinity changes, such as those associated with large rainfall events or tropical cyclones, can cause death of postlarval stages that are sensitive to unusually low salinities , and mass seaward migration and subsequent hyper-aggregation of mobile, commercially important species that can result in (1) shifts of juveniles from primary nursery areas protected from trawling to secondary non-nursery areas vulnerable to fishing pressure , (2) overharvest of adults due to increases in fishery catchability , or (3) bias fishery-independent surveys that leads to over-inflated population abundance estimates . Thus, the need to accurately predict the spatiotemporal dynamics of salinity is unprecedented. The specific goals of this study were to: (1) evaluate several statistical models to hindcast and forecast salinity in the second largest estuary and largest lagoonal estuary in the United States—Pamlico Sound, North Carolina, USA, and (2) assess salinity observations, predictions, and standard errors under five hydrologic scenarios characteristic of historic and future climate changes.
Pamlico Sound (PS) is a relatively shallow estuary with a mean depth of 4 m and a maximum depth of 7 m. PS circulation is dominated by wind-driven currents and freshwater input [13, 14]. Seasonal cyclonic storms are also an important climatological component of the PS system. Since 1996, over three tropical storms or hurricanes have passed within 300 km of the North Carolina coast per year . Given the important role that salinity plays in the abiotic and biotic system components of estuaries, and the likelihood that global climate change will increase the frequency of extreme weather events (e.g., floods, droughts, hurricanes—[9, 15, 16]), there is a critical need for models that can accurately forecast spatiotemporal variation in salinity (e.g., ). A recent review by Iglesias et al.  highlights the strengths of applying numerical modeling tools to characterize morpho-hydrodynamic processes in estuarine and coastal systems. Numerical methods can include a large variety of models and techniques, such as finite element, finite difference, finite volume, or Eularian-Lagrangian models (e.g., [17, 18, 19]). Complex, three-dimensional numerical models used for simulation and forecasting of dynamic estuarine salinity can require significant effort and computation time that is beyond the capabilities of many local management agencies. Local management agencies sometimes require a quick turnaround time for long-term simulations or short-term forecasts of estuarine salinity conditions, which could be produced using location-specific statistical models. Therefore, the goals of this study were to (1) develop and evaluate two types of statistical models of bottom salinity in PS, and (2) apply the best models to produce sound-wide retrospective maps of bottom salinity based on observational data. Bottom (as opposed to surface) salinity was chosen as the variable of interest because it characterizes habitats of mobile demersal species that are important members of benthic food webs, and that are the targets of valuable commercial and recreational fisheries. Hereafter, the term ‘salinity’ will always refer to bottom salinity unless otherwise noted.
1.1 Statistical models to predict dynamic salinity
Producing retrospective salinity maps based on observational data does not require a statistical model based on hydrological mechanisms that affect salinity; it is possible to perform individual spatial interpolations for each time period of interest using an ordinary kriging model or a universal kriging model with a simple spatial trend. Predicting salinity under a hypothetical set of conditions, however, does require a model that can ‘learn’ about hydrological mechanisms based on retrospective data (e.g., [20, 21]). Thus, the more comprehensive goal of this study was to produce retrospective maps of salinity by developing a space-time statistical model in which the mean function represents the hydrological mechanisms that affect salinity, and a spatial covariance function makes up the difference between the observed salinity data and the mean function’s salinity prediction.
To create such a model, we constructed explanatory variables that accounted for the effect of riverine freshwater inflow (FWI), distance to inlet sources of oceanic saltwater, and hurricane incidence on salinities at different locations in PS. We used a forward-selection process to choose which of these variables to keep in the model. Standard errors based on the covariance function allowed for assessment of strengths and weaknesses of the representation of the hydrology in the mean function. Since an additional goal of this study was to provide a template for researchers to build process-based models of normally-distributed estuarine variables, we considered only models that could be fit using procedures in the SAS® software package, yet can be adopted to R-statistical software.
Other process-based models of PS salinity in the literature—all of which are differential-equation-based deterministic models—provided important insights into how different variables influenced spatiotemporal salinity variation in PS ([22, 23], and others). However, these models ultimately lacked the spatial resolution and/or coverage of the entire area of interest of this study, and none quantified uncertainty at every space-time prediction location. For example, Xu et al.  predicted surface and bottom salinity, and temperature at 30-second intervals over a spatial grid with varying cell size (200–800 m2) in the Pamlico River Estuary (PRE), a PS tributary, using a customized extension of the Environmental Fluid Dynamics Code  to incorporate FWI from major tributary rivers, as well as tide and wind effects on circulation. Although this model incorporated environmental variation and produced salinity predictions suitable to assess long-term space-time trends, the PRE makes up only 18% of the area of PS. Predicting salinity across the entire PS using this model would require spatial domain expansion and re-parameterization, and such extensions are not planned (J. Lin, NC State University, pers. comm. on behalf of Xu et al. ).
Though we are unaware of researchers that have constructed space-time statistical models of salinity in PS, there are examples of applying statistical models for spatial prediction of salinity in other estuaries. For example, Rathbun  used independent multiple linear regression models with spatially-correlated errors to predict salinity and dissolved oxygen (DO) in Charleston Harbor, SC over a two-week time period in 1988 as a function of spatial coordinates and distance to the estuary mouth. Chehata et al.  performed three-dimensional spatial interpolation of salinity and DO measurements in Chesapeake Bay. Qiu and Wan  developed a salinity model based on time series analyses of salinity data for the Caloosahatchee River Estuary, Florida, USA. The structure of their model consisted of an autoregressive term representing the system persistence and an exogenous term accounting for physical drivers including freshwater inflow, rainfall, and tidal water surface elevation that cause salinity to vary. The model was calibrated and validated using up to 20 years of measured data collected they found that the time series model offers comparable or superior performance compared with its 3-D, numerical counterpart. This model has been used as a tool for water resources management projects relating to ecosystem restoration and water control in south Florida . Similarly, Ross et al.  examined the response of salinity in the Delaware Estuary, USA to climatic variations using statistical models and long-term (1950-present) records of salinity from the U.S. Geological Survey and the Haskin Shellfish Research Laboratory. The statistical models included non-parametric terms and were robust against auto-correlated and heteroscedastic errors. After using the models to adjust for the influence of streamflow and seasonal effects on salinity, several locations in the estuary showed significant upward trends in salinity. Insignificant trends are found at locations that are normally upstream of the salt front. The models indicate a positive correlation between rising sea levels and increasing residual salinity, with salinity rising from 2.5 to 4.4 psu per meter of sea-level rise. The results suggest that continued sea-level rise in the future will cause salinity to increase regardless of any variation in fresh water influx . Urquhart et al.  present the results of multiple statistical models that predicted daily, gridded surface salinity at 1 km resolution across Chesapeake Bay, USA as a function of surface reflectance estimates of salinity from the NASA Moderate Resolution Imaging Spectroradiometer (MODIS), onboard the Aqua platform satellite. Eight statistical methods were tested, and sea surface salinity was accurately predicted via remote sensed products with an accuracy that was more than sufficient for many physical and ecological applications .
None of these previous studies, however, attempted to explicitly represent the hydrological processes by which fresh and saltwater mixing affects estuarine salinity. In this paper, we describe the development of candidate explanatory variables to represent mechanisms affecting PS salinity and how that development led to consideration of two fundamentally different mean functions. We then describe the forward selection process by which candidate variables were chosen to be retained in the models, and how candidate covariance functions were selected to pair with each mean function. Next, we examined maps of salinity observations, predictions, and standard errors under five hydrologic scenarios, analyzed these results, and provided overall implications of the findings.
2. Methods and results
2.1 Data and notation
We used bottom salinity values measured by the North Carolina Division of Marine Fisheries (NC DMF) Pamlico Sound Trawl Survey Program 195 (the survey) every June and September from 1987 to 2006. The survey is conducted only in June and September each year. Designed to assess species abundance at depths over 2 m, the survey uses a weighted stratified random sampling design. For each time period, coordinates of stations are randomly generated within each of seven water body strata, with more stations allocated to larger strata, for a total of 54 stations per time period. Hereafter, we denote with the spatial domain that includes all points sampled within the seven strata mentioned over the entire 1987–2006 temporal domain. Figure 1 shows the geographic location of each sampling station in S. Salinity was measured using a YSI-85 multi-function meter at the beginning of each trawl and recorded along with depth and spatial reference coordinates. All spatial coordinates used in this analysis were converted from decimal degrees to northings and eastings in nautical miles (nmi) from a reference point (the origin in Figure 1) located southwest of at 34.6°N, −77.1°W. Salinity is always reported using the Practical Salinity Scale.
The temporal domain contains T = 40 time periods, or month/year combinations, indexed by the subscript , so that . A time period is approximately 2.5 weeks long, the time it takes to sample all stations. Since locations of the 54 stations sampled in each time period differed slightly, and since some data were missing in each time period, let represent the number of sites in time period t. Site refers to a specific spatial location nested within a particular time period and is indexed using the subscript where . The dataset included total observations of salinity, where . Denoted with observed salinity at site in time period .
The fresh water influx (FWI) data represented watersheds of the Neuse, Pamlico, Roanoke, and Chowan rivers, which comprise 80% of the land draining into PS . FWI observations were average daily river discharge rates collected by one US Geological Survey (USGS) gauge station per tributary (Figure 1): Neuse River (NR) station 02089500 in Kinston; Tar-Pamlico River (TPR) station 02083500 in Tarboro; Roanoke River (RR) station 02080500 in Roanoke Rapids; and Ahoskie Creek (AC) station 02053500 in Ahoskie, which gauges Chowan River inflow. Discharge rates in ft3/s for every day during the time domain (7305 days) were downloaded from the USGS Water Resources website for the state of North Carolina (USGS 2009) and were converted to m3/s. For each river, the gauge chosen was the furthest downstream gauge that recorded data over the entire temporal domain.
2.2 Candidate explanatory variables
The creation of explanatory variables reflects the modeling context—the objectives, the geographical features of the spatial domain, and the space-time coverage and resolution of the data—but the general thought process can be modified by other researchers in a different context. We index the term it as any variable that varies in both space and time, and with t any variable that varies over time but is constant over within a time period.
2.3 Freshwater influx indices
Sixty-one days is the average freshwater residence time of the four major rivers flowing into PS [30, 31, 32], accounting for the temporal lag between the upriver gauging of freshwater and the delivery of that water to S. Therefore, we defined the long-term metric for river and time period where , and as the average daily discharge rate in the 61 days prior to mt, the first day of the survey in time period t. Because Ramus et al.  calculated a seven-day residence time for the Neuse and Pamlico Rivers after Hurricanes Dennis and Floyd deposited 1 m of rainfall in eastern NC less than 2 weeks before the September 1999 survey, we defined the short-term metric , by averaging daily discharge rates in the 7 days prior to .
Since freshwater from river r in time period t should have more of an effect on the closer site i is to the river, a unique measure of the influence of and for each site was formed by dividing each by , , the distance separating the gauge on river from site within time period :
The coordinates of each gauge station were used to calculate distance because the gauge was the location of the and observations. Like all distances in this study, represents distance “as the crow flies” as opposed to water-path distance. Though the superiority of using water-path distance when modeling water-quality variables in stream and estuarine systems seems intuitive, results from studies that compare these two distance metrics are inconclusive. For example, Gardner et al.  found more accurate predictions of stream temperatures when models incorporated water-path distance, but only when this distance was modified and weighted by stream order. Peterson and Urquhart  predicted various nutrient concentrations in 17 Maryland rivers and concluded that using water-path distance works well when modeling certain nutrients, but not others, and that the crow-flies distance appeared to be the most suitable distance measure overall. Comparing the accuracy of predictions of water quality parameters generated from two different multiple linear regression models containing the explanatory variable “distance to inlet mouth”, Little et al.  found that predictions from models using water-path distance were no more accurate than those from models using crow-flies distance. None of these studies demonstrated marked predictive improvement using water-path distance, therefore we used crow-flies distance from each of the four river gauges to each of 2100 sample stations and over 6000 prediction locations.
The plot in Figure 2 of against Roanoke River typifies the relationships between salinity and each of the eight and variables. Larger values of the metric are associated with smaller values of salinity, but groups of observations have different slopes. Closer examination revealed that the different groups corresponded to different time periods. We attempted to account for the different slopes in two ways, first by considering the 28 pair-wise interactions among the andand second by considering 39 time-period indicator variables defined as
(A fortieth indicator variable was not used because it would create a non-full-rank design matrix, and the effect for the fortieth time period can be derived using the intercept.) This latter consideration led to the creation of two distinct mean function models: the process and time models. The first has process variables only, and the second has process variables in addition to the time-period indicator variables to address the possibility that salinity is affected by some aspect of physical phenomena that is not accounted for by any other variable in the model.
2.4 Saltwater mixing and tidal signal
Although salinity on the inner-continental shelf of the U.S. Southeast Atlantic coast exhibits some spatial variability near PS , we follow Xie et al.  and assume constant open ocean salinity. This assumption allows for modeling the effect of ocean water mixing as a function of only the distance to inlet, as opposed to distance interacting with the salinity of the ocean water, from each spatial location in the sound to each of the major PS inlets: Oregon, Hatteras, and Ocracoke. Exploratory analyses reveal that models using a single variable (distance to the nearest inlet) rather than three variables (distances to each of the three inlets), explains the same amount of variability in salinity when other explanatory variables are also included. Therefore, we consider for inclusion in subsequent models the variable , defined to be the distance separating site i, sampled in time period t, from the center of the most proximate inlet.
2.5 Wind speed and direction
A prevailing wind field that is north/northeast from March to August and south/southwest from September to February is the primary driver of currents in PS . Thus, wind speed and direction were incorporated into the modeling process using the categorical variable montht, where
is used to examine the effects of seasonal wind patterns on the spatial distribution of salinity.
2.6 Evaporation and direct precipitation
Holding other factors constant, sound-wide salinity in time periods that experience more evaporation of water from the surface of PS would likely be higher than those in time periods that experienced less evaporation, but no evaporation data were available for the space-time domain of interest. Salinity in time periods for which there was more direct precipitation into should be lower than those in lower-precipitation time periods, however precipitation data were only available at two weather stations on the edges of PS from which information about individual spatial locations within PS would be difficult to infer. Giese et al.  found that direct precipitation constitutes only 8% of mean PS freshwater input, thus the signal from riverine FWI should dominate in explaining salinity variability. Therefore, we did not include evaporation or direct precipitation variables in our models.
2.7 Spatial coordinates
Estuarine salinity varies over space such that functions of spatial coordinates might explain variability in salinity not accounted for by the other variables. Scatterplots of salinity versus easting and northing suggested that salinity is quadratic in the former and cubic in the latter. The quadratic function of easting can be explained by examining a west-to-east path through PS along the 35° 16′ N parallel (A in Figure 1): salinity should initially increase, reach a maximum at the saltwater plume near Ocracoke and Hatteras Inlets, and decrease again on the other side of the plume in the waters on the western shore of Hatteras Island near Buxton, NC. The cubic function of northing is best described by examining a north-to-south path along longitude of 75° 42′ W (B in Figure 1), where salinity should increase traveling south from Albemarle Sound, reach a local maximum near Oregon Inlet, decrease continuing past the saltwater inlet plume, and increase again as the Hatteras Inlet saltwater plume is reached. Thus, , , , , and the interactions , , , and are considered as explanatory variables. All coordinates are centered before they are squared or cubed by subtracting the mean over all observations.
Hurricanes can rapidly introduce large volumes of freshwater to estuaries via riverine influx, push large volumes of saltwater in through inlets via storm surge, and alter circulation patterns through abrupt changes in wind speed and direction [7, 10]. Hurricanes can also open new inlets to PS, which can alter current flow and increase saltwater intrusion . The variable should capture variability in salinity due to hurricane-produced FWI. Three additional variables may account for non-FWI-related variability in salinity due to hurricane passage. These variables are unique to a given time period t but are constant over all sites within t. The continuous variable is the reciprocal of the number of days between the most recent hurricane and , except when there is no hurricane within the 61 days, and then it takes the value zero. The categorical variable equals the category of the most recent hurricane rated on the Stafford-Simpson scale (1, …, 5), but if no hurricane made landfall in the 61 days prior to mt, it takes the value zero. Finally, the discrete variable num_stormst equals the number of hurricanes making landfall in NC in the 61 days prior to .
2.9 Variable selection
Section 3 identifies 46 candidate explanatory variables for the process model mean function: and (8), plus selected pair-wise interactions (explained below) (24); spatial coordinates, their powers, and specified interactions (9); ; ; and hurricane variables , , and num_stormst. For the time model, there were an additional 39 time period indicator variables. Some variables—in either model—may be redundant. There is overlap among the hurricane variables, and spatial coordinates may not be necessary if other variables explain more variability in salinity. The set of variables included in the final model(s) should balance goodness-of-fit with parsimony. We first describe the variable-selection process for the process model, then for the time model.
2.10 Process model
The results of eight separate ordinary least squares linear regression models of salinity make up the rows Table 1. The first five consist of an intercept and a single explanatory variable: , , , num_stormst, and . The sixth and seventh contain an intercept plus, respectively, the sets of four short and long-term freshwater influx indices , and . We treated the short and long-term sets of indices as groups assuming that if an index evaluated for one river is meaningful, then it is also meaningful for other rivers. We discuss the eighth row in Section 4.2.
Adjusted R2 is a modification of R2 that penalizes the number of explanatory variables. While R2 increases as more variables are added to a model, adjusted R2 increases only if the added variable decreases the error sum of squares enough to offset the loss in error degrees of freedom.
The model with the long-term freshwater influx indices had the largest adjusted R2 at 0.38, followed by the model with the distance from the nearest inlet (0.34), and the model with the short-term FWI indices (0.27). None of the other four models explained more than 5% of the variability in salinity. We chose the model with the long-term freshwater influx indices as the base upon which to build the mean function.
To this base model we added the variable since the model containing this variable had the second-best performance, thus beginning a forward-selection process. Each time we added a variable or set of variables to the model, we kept it in the model if the new adjusted R2 exceeded the old. Variables from the seven initial models were then added in order of decreasing adjusted R2. Following this procedure, the mean trend model grew to contain 10 variables—, , , and —with adjusted R2 0.57.
Because the effect of FWI from one river on a given location in PS could change based on the FWI from another river during the same time period, we evaluated the addition of the 6 pair-wise interactions among the four , the 6 pair-wise interactions among the four , and the twelve interactions between the and the , excluding interactions of one river’s with its own . Despite a decrease in error degrees of freedom by 24, adjusted R2 was 0.66, so the set was retained.
Spatial coordinate variables were evaluated last in groups according to their polynomial order, with squared and cubic terms added before interactions. We considered these variables last because we wanted to include them only if they explained additional variability in the response after more interpretable variables were included. We determined that including all variables except increased the adjusted R2. The final process model mean function thus had an adjusted R2 of 0.73 and included the following: ; ; ; ; ; ; ; , , , , and interactions , , and .
2.11 Time model
To build the time model, we followed the same procedure described above, selecting for the base of the mean function a set of time period indicator variables because a linear regression of on these variables had an adjusted R2 of 0.41 (Table 1). (Note that such a model is equivalent to fitting an ANOVA model using the time periods as groups.) Again, we added other sets of explanatory variables in order of decreasing adjusted R2. Before evaluating interactions, the mean trend time model had an adjusted R2 of 0.78 and contained 48 variables: , , , and . When interactions among the and thewere added, the model was not full rank (not all columns in the design matrix were linearly independent). Because we created this second model to evaluate these interactions, we removed the , the most recent variable addition, to include them. This new model, including the interactions, became the base since its adjusted R2 (0.89) was larger than that of the previous mean trend time model (0.78). After investigating spatial coordinate variables, the final mean trend time model (below) had an adjusted R2 of 0.91 and included 204 variables: , , , , , ,, and . To avoid confusion later, note that the adjusted R2 of 0.73 for the process model and 0.91 for the time model were based on fitting each model to the full dataset. In the next section, we report R2 (not adjusted R2) based on a cross-validation dataset.
2.12 Modeling spatially correlated error
The variable selection analyses above used ordinary least squares (OLS) regression to model salinity as a function of explanatory variables. That model can be written as
where represents the value of the explanatory variable at space–time location , for , where is the total number of explanatory variables. represent the intercept and regression coefficients, and deviations from the mean trend are assumed to be independent and identically distributed with mean 0 and variance . The model can be equivalently written as , where is the vector containing the values of the explanatory variables at space-time location , and represents the vector of regression coefficients. The same model written in matrix form is
where bold print indicates vectors so that , ε, and 0 are vectors containing, respectively, all observations of salinity in the space–time domain, all deviations from the mean function, and all zeros. is the design matrix whose rows represent space-time locations and whose columns contain the values of the explanatory variables (with a column of ones for the intercept), and is the identity matrix. Since a histogram of salinity observations is somewhat symmetric and bell-shaped, use of the normal distribution is justified.
Rarely, however, does the assumption of independent and identically distributed errors hold for observations of natural phenomena associated with locations in space and time. While it is intuitive that values of salinity located close together in space should be similar, it is also generally the case that the deviations from the mean function of observations located close together are similar. That similarity is referred to as spatial covariance, and the spatial covariance between deviations from the mean trend at two locations within the same time period can be modeled as a function of the distance separating them. Including in the overall model both a deterministic mean function and a spatial covariance function allowed predictions of salinity at locations where there were no observations.
Valid covariance functions ensure that the covariance matrix will be positive definite, which, in turn, ensures that variances will be non-negative. Each covariance function has a shape defined by a range parameter, a partial sill, and sometimes a nugget effect. Appendix Table A1 gives formulas for determining spatial covariance according to the exponential, Gaussian, and spherical covariance functions, each with and without a nugget effect. Figure 3 shows an example of the spherical covariance function—the solid red line—fit to a sample covariogram—the blue dots—of deviations from the process model for June 1994. The range parameter—for the exponential and Gaussian covariance functions, for spherical—is related to the distance that must separate two sites before their deviations are independent, where independence corresponds to a covariance of zero or virtually zero. In Figure 3, the range is approximately 10 km. In the absence of a nugget effect, the partial sill is the value of the covariance at distance zero—that is, it is the variance of deviations from the mean—and in Figure 3 this value is approximately 2.5 squared units of salinity. In the presence of the nugget , there is a discontinuity in the covariance function at distance zero, so that the intercept is slightly greater than the limit of the smooth part of the function as distance approaches zero. In this case, the variance of the deviations is equal to the sum of the partial sill and nugget: . It may be the case that variance is higher when values of deviations are higher. Since covariance parameters represent physical quantities that may change over time, we used the capabilities of SAS® Proc Mixed to allow a different partial sill and range parameter for each time period.
Model (3), modified to include spatial correlation, becomes
where represents the block-diagonal covariance matrix
where zero matrices for off-diagonal elements indicate that deviations in one time period are not correlated with those in another. We make this assumption partially due to the long time span separating June and September, but also because no SAS® procedure has the capacity to model such space-time correlation while at the same time allowing every time period to have different spatial covariance parameters and allowing a mean function to be fit. Diagonal elements are individual spatial covariance matrices for each time period with dimensions , and elements representing the spatial covariance between sites i and j in time period t.
Understanding how predictions of salinity and prediction standard errors are generated from this model will make the results and analysis in Sections 6 and 7 easier to understand. To predict salinity at space–time locations where it is not observed, the following results are needed. Superscripts differentiate between locations where salinity is observed and unobserved. Model (4), represents observations of salinity (by virtue of the dimensions of the vectors and matrices), but we model salinity observations and unobserved values of salinity at other space-time locations using a similar model, the joint distribution of unobserved and observed salinity, given by
Here, represents the vector of salinity observations, and, letting represent the number of space-time locations at which we want to predict salinity, represents the vector of unknown values of salinity at these locations. All the symbols in (5) have the same meaning as in (4), except for the distinction between observed and unobserved locations. The matrix contains the cross-covariance between observed and unobserved locations. Thus,
and . The elements also come from the spatial covariance function.
Let represent the vector that contains all spatial covariance parameters for every time period—either 80 or 81 parameters depending on whether a nugget effect is used. Standard normal distribution theory gives the distribution of unobserved salinity conditioned on knowing the values of observations and all of the parameter values:
The pipe symbol (|) means “given” or “conditioned on knowing the values of” the terms following the pipe symbol. The terms before the comma represent the mean of the multivariate normal distribution, which is used for the salinity prediction, and the terms after the comma represent the variance-covariance matrix, which is used for prediction standard errors. Salinity predictions are the sum of the mean trend, , and the spatial interpolation of observation deviations from the mean trend, . If the deviations are large for a given time period, then the partial sill will be large for that time period, so that diagonal elements of and will be large. For a given location, the prediction standard error is the diagonal element of the matrix . If the diagonal elements of and are large, then the diagonal elements of are small, and the prediction standard error is a large number minus a small number. That is, the prediction standard error will be high for time periods in which observation deviations from the mean function are large. When observation deviations from the mean trend are small, the reverse is true, and prediction standard errors tend to be low for that time period.
The salinity predictor
is an exact predictor: the prediction of salinity at a site where there is an observation will exactly equal the observation. For this reason, to determine which spatial covariance function to use, we randomly selected 10% of the observations to withhold as a cross-validation dataset, the test dataset; the remaining 90% we term the base dataset. For every combination of the two mean functions—process and time—and the six spatial covariance functions in Appendix Table A1, we fit model (4) to the base dataset, and predicted salinity values at the space–time locations of the test dataset using the results given in (5) and (6). When the model predicted salinity to be less than zero, we set the prediction equal to zero before calculating the following statistics. Predictions of negative values could be avoided using a truncated normal distribution, but SAS® Proc Mixed does not permit specification of this distribution. The root mean squared error (RMSE) of predictions—with the same units as salinity—are given in Table 2, along with the slope, intercept, and coefficient of determination (R2) from a regression of actual salinity values in the test dataset on predictions of them. If predictions were perfect, this regression would have slope equal to one, intercept equal to zero, and R2 equal to 1.
Salinity predictions are better when a spatial covariance function is combined with either mean function. For example, of the time models, the exponential covariance function with a nugget produced predictions with the lowest RMSE (2.1), slope closest to one (0.92), and intercept closest to zero (1.55). Comparing process models, the exponential and spherical, each with and without a nugget, performed equally well, and better than the time models. To select the best model from this group of four, we examined statistics based on how well the model fit the base dataset. The model with an exponential covariance function with a nugget had the lowest AIC (7580.0) and BIC (7711.7) and was thus chosen as the final model. It explained 89% of variability in the test dataset and generated predictions with RMSE 2.0.
Next, we fit this model using the full dataset, and produced retrospective maps of salinity predictions and standard errors at evenly spaced 1 nmi (1.85 km) increments for each time period. Forty-two salinity predictions—less than 0.1% of the total number of predictions—were negative and set to zero.
2.13 Examining freshwater influx scenarios
To examine variations in the spatial distribution of salinity under drought, average, and flood conditions, we classified freshwater influx from each river within each time period (and ) as LOW if it fell below the 25th percentile of observed FWI across all time periods, MODERATE if it fell between the 25th and 75th percentiles, HIGH if it fell between the 75th and 95th percentiles, and FLOOD if it fell above the 95th percentile. Next, we classified one-week and two-month FWI for the entire time period as LOW (or HIGH) if at least two rivers exhibited low (or high) inflow, MODERATE if at least three rivers exhibited moderate inflow, and FLOOD if at least one river exhibited extremely high (>95th percentile) inflow. These classifications are mutually exclusive, though some of the 40 time periods did not fall into any of them. The first two columns of Table 3 list the 16 combinations of classifications, and the third column shows the classification and salinity rank for each time period. Time periods were ranked 1–40 by mean predicted salinity (1 = highest mean salinity; 40 = lowest).
Moderate-to-moderate FWI. June 2005 (Figure 4) experienced moderate FWI in both the 2 months and 1 week prior to the survey in PS with predicted salinity ranked 37th—the lowest of the moderate-to-moderate time periods. Legend colors for model predictions in the left pane and observations in the upper right pane of Figure 4 (as well as Figure 5A and B,6A and B) are based on percentiles of the distribution of observed salinity across all time periods: minimum to 5%; 5–10%; 10–25%; 25–50%; 50–75%; 75–90%; 90–95%; and 95% to maximum. From the left pane of Figure 4, predicted salinity in June 2005 increased moving east across PS, reaching a maximum just south of Oregon Inlet. We note the same east-west salinity gradient when comparing this pane to the June 2005 map of observed salinities (top right pane), indicating that prediction maps typically mirror trends seen in observation maps. The area of highest predicted salinity corresponds to a lone purple observation of 26.5 just south of Oregon Inlet (Figure 4). Plumes of relatively higher salinity are evident in the vicinity of all three ocean inlets (Figure 4).
The lower right pane of Figure 4 (as well as Figure 5A and B,6A and B) displays prediction standard errors (SE) with the same units as salinity. The same eight percentile groups classify colors on the SE legend, here based on the distribution of prediction standard errors across all time periods. The transition from low SE at sample sites to higher SE moving away from sample sites reflects the fact that the exact predictor (6) reproduces observations, so confidence intervals closer to sample sites are narrower than those further away.
This spatial trend in SEs is further illustrated by comparing locations of high SE in the same time period, which are also consistent over time. High SEs occur between the mouths of the Neuse and Pamlico Rivers and along a margin of varying width following the outline of the Outer Banks, areas within which sampling does not occur (Figure 1). We note here that because SEs increase as distance from sample site increases, we chose to generate only interpolated (and not extrapolated) salinity predictions. In June 2005, as in all other time periods, predictions were generated only for locations within S, which does not extend either to Albemarle Sound or to the heads of the Neuse and Pamlico Rivers (Figure 4).
Low to low FWI in early and late-stage drought. June 1999 (Figure 5A) and June 2002 (Figure 5B)—which mark early and late stages of North Carolina’s 1998–2002 drought —experienced low long- and short-term FWI with predicted salinity ranking 12th and 4th, respectively. At every point in PS, predicted salinity in these two time periods was higher than in June 2005, and predicted salinity was much higher in June 2002 than June 1999, though both have similar values for and variables from three of the four tributary rivers. The difference may be due to (1) the fact that in the fourth river, the Roanoke, and in June 1999 were twice their values in June 2002, or (2) that by June 2002, NC had been experiencing drought conditions for 4 years (186 weeks) as opposed to less than one (30 weeks) and that this cumulative FWI deficit became more pronounced over time.
Though June 2002 salinity observations have a larger mean and greater variability, the majority of prediction standard errors are less than 1.01. In June 1999, however, SEs fell between 1.01 and 1.81 at all prediction locations except those that were very close to observations. This result shows that the conditions affecting salinity in PS were better represented by the mean function in June 2002 than they were in June 1999.
Flood to flood FWI—with and without hurricanes. FWI was extremely high in September 1999 (Figure 5A) as a result of the 500-year floods produced by Hurricanes Dennis and Floyd that occurred 24 and 12 days before the survey, respectively. In June 2003 (Figure 5B), extremely high FWI was due to an eight-month period of above-average precipitation totals prior to the survey. Though these are the only two time periods categorized as flood-to-flood, predicted salinity in September 1999 ranks a surprisingly high 30th, while in June 2003 it ranks 40th. Observed and predicted salinity for these two time periods are lower than those in the low-FWI time periods of June 1999 and 2002, but in September 1999, salinity was higher at most prediction locations, and more variable, than in moderate-FWI June 2005. Water at locations near the two southerly inlets to PS was more saline in September 1999 than in these same locations during moderate-FWI of June 2005 likely due to storm surge-generated inlet plumes. Salinity at locations near the Neuse and Tar-Pamlico Rivers was similar to that in June 2005. Standard errors were lower sound-wide in June 2003 than in September 1999. SEs in September 1999 were highest sound-wide relative to the other four time periods examined (Figures 5 and 6).
Because water exchange between lagoonal estuaries and the open ocean can be relatively restricted, there is a relatively high potential in systems like PS for changes in precipitation patterns and storm frequencies associated with global climate change to result in changes in salinity patterns and subsequent ecosystem alterations. Changes in precipitation will affect the amount and timing of river flow, which will impact nutrient cycling, estuarine flushing rates, and salinity. Increased storm activity may open new inlets, which would alter current flow, increase tidal action, and allow a greater influx of seawater that carries with it both different chemical signals and mobile species. Salinity is therefore a practical estuarine characteristic to use to study the impacts of these changes, as both effects mentioned above include enhanced water exchange that impacts overall estuarine salinity content [43, 44].
We developed and evaluated two statistical models, using the best model to hindcast salinity in PS. The process mean function combined with the exponential covariance with a nugget explained 89% of the variability in a test dataset with a RMSE of 2.0 and produced relatively accurate retrospective salinity maps under a wide range of freshwater influx and system-state scenarios. Much of this accuracy was due to allowing the range and partial sill parameters of the spatial covariance to be time-period specific. We then examined variations in the spatial distribution of salinity under varying freshwater influx (FWI) conditions such as drought, average FWI, and flood conditions, and identified the following patterns. In years with moderate FWI, the salinity gradient increased from west to east in PS as expected, and was highest adjacent to the major inlets, with highest salinities near Oregon Inlet. In years with low FWI indicative of drought conditions, the overall mean and variance in salinity increased in PS. In years with floods, salinities displayed a high degree of spatial variation, with salinities being lower near the tributaries as expected, yet also displaying occasional sharp increases in salinity near inlets due to influx of ocean water into PS via the major inlets.
3.1 Improvements to model predictions
For retrospective prediction purposes, model improvements could focus on improvements to the mean trend, the covariance, or both, and such improvements could be evaluated using the test dataset. A reasonable goal might be to increase R2 to 0.93 or to reduce RMSE to 1.5. Improvements for the purpose of prospective prediction of salinity under hypothetical, unobserved conditions, a situation in which spatial covariance among observation deviations cannot be used, would entail improving the mean function exclusively. Locations and time periods with high SEs highlight conditions not well-represented by the current mean function. A reasonable goal here would be to produce a model for which all values of SE fall beneath the current median (1.32).
Mean function. The mean function alone explained over two-thirds of the variability in salinity in both process and time models. While this is a noteworthy accomplishment, there remains room for multiple improvements. High SE values in Figure 5A show that the mean function is unable to capture the interaction between high FWI in September 1999 and hurricane storm surges. One hurricane explanatory variable, inverse_days_surveyt, remained in the final process model. Its parameter estimate was positive, reflecting that strong hurricane winds push more saltwater into PS through inlets than would enter under typical seasonal wind conditions, but alone it explained only 4% of salinity variability in the full dataset. The inverse_days_surveyt, variable did not differentiate between a year in which a single hurricane passed within 12 days of the survey and a year in which such a hurricane followed another that passed 12 days earlier. A future effort might attempt to account for cumulative build-up of storm surge on observed PS salinities.
Though alone explained a third of the variability in salinity over all time periods, variability in inlet-plume size across Figures 3–5 suggests that this distance metric should be modified based on wind speed and direction, using more finely resolved wind information than the variable. Devising a way to use the u and v components of wind to interact with could allow both the size and the direction of the inlet plume to vary such that east-to-west winds create different plume sizes and shapes than winds from the southeast-to-northwest. Considerable exploratory analysis would be needed to determine what pre-survey time lag should be considered to affect observed survey salinities.
Differences in both salinity values and SE estimates between early-stage drought during June 1999 and late-stage drought during June 2002 suggest accounting for effects of FWI over a longer duration than 61 days. Doing so might explain differences in salinity patterns seen in time periods with similar one-week and two-month FWI conditions. Molina  calculated an 11 month mean residence time for freshwater in PS. We could incorporate this effect by adding a third freshwater influx index to the mean function or by adding an autoregressive component to the model so that salinity in a given time period was a function of mean salinity in the previous time period. The first option would be tedious from a data-manipulation standpoint, but much easier from a mathematical model-fitting standpoint, because SAS® Proc Mixed could still be used. The second option necessitates a change in the covariance function, as we can no longer assume that salinity deviations from the mean function at a given space-time point were independent in time. This second option would also require specialized hand-written code, as no current SAS® Proc allows such a dynamic space-time model to be fit.
Differences in salinity patterns between June 1999 and June 2002, our two low-to-low FWI time periods, could be attributed to differences in FWI from the Roanoke River, one of the two northern rivers whose connection to PS is indirect. This observation warrants further investigation into the calculation of the FWII indices; namely, an investigation of water-path distance as a possible substitute for crow-flies distance between river gauges and sites in PS. Although we did not find a study that demonstrated marked predictive improvement using water-path distance under all circumstances ([36, 46], and others), it would be interesting in future work to compare differences in PS salinity predictions using both distance methods. Recall that Gardner et al.  noted more accurate predictions of stream temperatures when models incorporated water-path distance, but only when this distance was further modified and weighted by stream order. It might be the case that water-path distance out-performs crow-flies distance in predicting estuarine salinity when care is taken to make all explanatory variables as meaningful as possible. Development of an automated procedure for calculating water-path distances similar to the one used in  would make such an investigation more practically feasible.
Covariance function. Two mutually-exclusive improvements to the covariance function, as implemented in SAS® Proc Mixed, could be investigated: using either the Matern covariance function or an anisotropic covariance function to achieve greater flexibility in each time period. The Matern covariance function has a smoothing parameter in addition to partial sill and range parameters. When the smoothing parameter takes the value of 0.5, the Matern covariance function is the same as the exponential covariance function—as the smoothness parameter approaches infinity, the covariance function approaches the Gaussian covariance function. Using the Matern covariance function is thus equivalent to allowing a third parameter to determine which two-parameter covariance function is appropriate, as opposed to using the same two-parameter covariance function for every time period. The computational cost of this flexibility is high—in a similar model with only four separate groups of covariance parameters, compared to the 40 groups in this paper—co-author Amy Nail experienced computation time of 2 h (versus a 2 min run time using the two-parameter exponential covariance function here). The added computational burden is due to the complex nature of the Matern covariance function and to the necessity of estimating one additional covariance parameter per time period (for a total of 40 additional parameters).
Another way to achieve flexibility while still specifying a single covariance function for every time period, would be to allow an anisotropic covariance function. Geometric anisotropy allows for different range parameters in different directions. For example, if the water current in PS were flowing directly north-to-south, two points separated by a north-to-south vector might have more similar values of salinity than would two points separated by a west-to-east vector of the same length. Fortunately, the parameterization of a geometric anisotropic covariance function is such that if anisotropy were unnecessary, the parameters would take values that effectively result in an isotropic covariance function. The cost of this added flexibility is the need to estimate two additional covariance parameters per time period, for a total of 80 additional parameters. Computation time might be less here than for Matern, since anisotropic covariance functional forms are less complex.
We created a statistical model combining a process mean function with an exponential spatial covariance function with a nugget to predict salinity in a lagoonal estuary. This model can generate predictions of bottom salinity for Pamlico Sound, NC that are more spatially-resolute than any previous bottom salinity predictions encountered in the literature for this system. The salinity maps produced using the model are useful for researchers to build an intuitive understanding of salinity dynamics under PS conditions covered by these 40 time periods. Salinity predictions can also be used to inform future analyses including, but not limited to, the examination of historical distribution patterns of estuarine species relative to salinity variability and the prediction of salinity changes under various global climate change scenarios.
We thank the North Carolina Division of Marine Fisheries and the United States Geological Survey for providing datasets used in this study. We also thank editor A. Manning for helpful comments that improved the manuscript. Funding for this project was provided by the Environmental Defense Fund (Program Manager Pam Baker), North Carolina Coastal Recreational Fishing License Program (Grant No. 2010-H-004), North Carolina Sea Grant (R12-HCE-2) and the National Science Foundation (OCE-1155609) to D. Eggleston. A. Nail was supported as a VIGRE Postdoctoral Fellow by NSF grant DMS 0354189.
See Table A1.
|Name of covariance function|
|With nugget effect||Without nugget effect|
|Note: For all models,, , andis the distance separating sitesiand.|
Note: I(statement) = 1 if statement is true and 0 otherwise.
|Explanatory variable or set of explanatory variables||Adj R2|
|Model type||−2 log likelihood||AIC||BIC||RMSE (psu)||Slope/β1||Intercept/β0||R2|
|Exponential + σ2n*||7424.0||7580.0||7711.7||2.0||0.96||0.96||0.89|
|Gaussian + σ2n*||7532.0||7686.0||7816.0||2.1||0.94||1.15||0.87|
|Spherical + σ2n*||7571.6||7727.6||7859.3||2.0||0.96||0.93||0.89|
|Exponential + σ2n||6217.1||6367.1||6493.7||2.1||0.92*||1.55*||0.87|
|Gaussian + σ2n*||6214.0||6366.0||6494.4||2.2||0.91*||1.90*||0.86|
|Spherical + σ2n||6201.3||6357.3||6489.1||2.2||0.91*||1.86*||0.86|
|2mo_FWIrt||1wk_FWIrt||Time periods and mean predicted salinity rank (mmyy, r)|
|Flood||Flood||(0603, 40), (0999*, 30)|
|Moderate||(0687, 28), (0689, 27)|
|High||Flood||(0903*, 39), (0690, 29)|
|Moderate||(0698, 38), (0693, 36), (0697, 35)|
|High||(0696, 26), (0900, 24)|
|Moderate||(0605, 37), (0989, 31), (0601, 26), (0600, 22), (0604, 21), (0688, 16), (0990, 13), (0692, 10)|
|Low||(0997, 15), (0699, 12), (0901, 8), (0902, 7), (0993, 5), (0602, 4), (0988, 3), (0994, 1)|