## 1. Introduction

Previous studies have documented various QC tools for use with weather data (26; 4; 6; 25; 9; 3; 10; 16; 18). As a result, there has been good progress in the automated QC ofweather indices, especially the daily maximum/ minimum air temperature. The QC of precipitation is more difficult than for temperature; this is due to the fact that the spatial and temporal variability of a variable (2) is related to the confidence in identifying outliers. Another approach to maintaining quality of data is to conduct intercomparisons of redundant measurements taken at a site. For example, the designers of the United States Climate Reference Network (USCRN) made it possible to compare between redundant measurements by specifying a rain gauge with multiple vibrating wires in order to avoid a single point of failure in the measurement process. In this case the three vibrating wires can be compared to determine whether or not the outputs are comparable and any outlying values can result in a site visit. CRN also includes three temperature sensors at each site for the purpose of comparison.

Generally identifying outliers involves tests designed to work on data from a single site (9) or tests designed to compare a station’s data against the data from neighboring stations (16). Statistical decisions play a large role in quality control efforts but, increasingly there are rules introduced which depend upon the physical system involved. Examples of these are the testing of hourly solar radiation against the clear sky envelope (Allen, 1996; Geiger, et al., 2002) and the use of soil heat diffusion theory to determine soil temperature validity (Hu, et al., 2002). It is now realized that quality assurance (QA) is best suited when made a seamless process between staff operating the quality control software at a centralized location where data is ingested and technicians responsible for maintenance of sensors in the field (16; 10).

Quality assurance software consists of procedures or rules against which data are tested. Each procedure will either accept the data as being true or reject the data and label it as an outlier. This hypothesis (Ho) testing of the data and the statistical decision to accept the data or to note it as an outlier can have the outcomes shown in Table 1:

Statistical Decision | True Situation | |

Ho True | Ho False | |

Accept Ho | No error | Type II error |

Reject Ho | Type I error | No Error |

Take the simple case of testing a variable against limits. If we take as our hypothesis that the data for a measured variable is valid only if it lies within ±3σ of the mean (X), then assuming a normal distribution we expect to accept Ho 99.73% of the time in the abscense of errors. The values that lie beyond X±3σ will be rejected and we will make a Type I error when we encounter valid values beyond these limits. In these cases, we are rejecting Ho when the value is actually valid and we therefore expect to make a Type I error 0.27% of the time assuming for this discussion that the data has no errant values. If we encounter a bad value inside the limits X±3σ we will accept it when it is actually false (the value is not valid) and this would lead to a Type II error. In this simple example, reducing the limits against which the data values are tested will produce more Type I errors and fewer Type II errors while increasing the limits leads to fewer Type I errors and more Type II errors. For quality assurance software, study is necessary to achieve a balance wherein one reduces the Type II errors (mark more “errant” data as having failed the test) while not increasing Type I errors to the point where valid extremes are brought into question. Because Type I errors cannot be avoided, it is prudent for data managers to always keep the original measured values regardless of the quality testing results and offer users an input into specifying the limits ± *f*σ beyond which the data will be marked as potential outliers.

In this chapter we point to three major contributions. The first is the explicit treatment of Type I and Type II errors in the evaluation of the performance of quality control procedures to provide a basis for comparison of procedures. The second is to illustrate how the selection of parameters in the quality control process can be tailored to individual needs in regions or sub-regions of a wide-spread network. Finally, we introduce a new spatial regression test (SRT) which uses a subset of the neighboring stations to provide the “best fit” to the target station. This spatial regression weighted procedure produces non-biased estimates with characteristics which make it possible to specify statistical confidence intervals for testing data at the target station.

## 2. A Dataset with seeded errors

A dataset consisting of original data and seeded errors (18) is used to evaluate the performance of the different QC approaches for temperature and precipitation. The QC procedures can be tracked to determine the number of seeded errors that are identified. The ratio of errors identified by a QC procedure to the total number of errors seeded is a metric that can be compared across the range of error magnitudes introduced. The data used to create the seeded error dataset was from the U.S. Cooperative Observer Network as archived in the National Climatic Data Center (NCDC).We used the Applied Climate Information (ACIS) system to access stations with daily data available for all months from 1971~2000(see 24). The data have been assessed using NCDC procedures and are referred to as “clean” data. Note, however, that “clean” does not necessarily infer that the data are true values but, means instead that the largest outliers have been removed.

About 2% of all observations were selected on a random basis to be seeded with an error. The magnitude of the error was also determined in a random manner. A random number, *r*, was selected using a random number generator operating on a uniform distribution with a mean of zero and range of ±3.5. This number was then multiplied by the standard deviation (σ_{x}) of the variable in question to obtain the error magnitude *E* for the randomly selected observation *x*:

The variable*r* is not used when the error would produce negative precipitation, (*E*
_{
x
}+*x*)<0., Thus the seeded error value is skewed distributed when *r*<0 but roughly uniformly distributed when *r*> 0. The selection of 3.5 for the range is arbitrary but does serve to produce a large range of errors (±3.5σ_{x}).This approach to producing a seeded data set is used below in some of the comparisons.

## 3. The spatial regression test (estimates)and Inverse Distance Weighted Estimates (IDW)

When checking data from a site, missing values are sometimes present. For modeling and other purposes where continuous data are required, an estimate is needed for the missing value. We will refer to the station which is missing the data as the target station. The IDW method has been used to make estimates (x’) at the target stations from surrounding observations (x_{i}).

Where d_{i} is the distance from the target station to each of the nearby stations, f(di) is a function relying on d_{i}(in our case we took f(d_{i})=1/d_{i}). This approach assumes that the nearest stations will be most representative of the target site.

*Spatial Regression (SRT)* is a new method that provides an estimate for the target station and can be used to check that the observation (when not missing) falls inside the confidence interval formed from N estimates based on N “best fits” between the target station and neighboring stations during a time period of length *n*. The surrounding stations are selected be specifying a radius around the station and finding those stations with the closest statistical agreement to the target station. Additional requirements for station selection are that the variable to be tested is one of the variables measured at the target site and the data for that variable spans the data period to be tested. A station that otherwise qualifies could also be eliminated from consideration if more than half of the data is missing for the time span (e.g. more than 12 missing dayswhere n=24) First non-biased, preliminary estimates
*t*, and for each surrounding station (

The approach obtains an un-biased estimate (*x’*) by utilizing the standard error of estimate (*s*) for each of the linear regressions in the weighting process.

The surrounding stations are ranked according to the magnitude of the standard error of estimate and the N stations with the lowest s values are used in the weighting process:

This approach provides more weight to the stations that are the best estimators of the target station. Because the stations used in (4) are a subset of the neighboring stations the estimate is not an areal average but a spatial regression weighted estimate

The approach differs from inverse distance weighting in that the standard error of estimate has a statistical distribution, therefore confidence intervals can be calculated on the basis of *s’* and the station value (*x*) can be tested to determine whether or not it falls within the *confidence* intervals.

If the above relationship holds, then the datum passes the spatial test. This relationship indicates that with successively larger values of *f,* the number of potential Type I errors decreases. Unlike distance weighting techniques, this approach does not assume that the best station to compare against is the closest station but, instead looks to the relationships between the actual station data to settle which stations should be used to make the estimates and what weighting these stations should receive. An example of the estimates obtained from the SRT is given in Table 2.

Using the above methodology, the rate of error detection can be pre-selected. The reader should note that the results are presented in terms of the fraction of data flagged against the range of *f* values (defined above) rather than selecting one *f* value on an arbitrary basis. This type of analysis makes it possible to select the specific *f* values for stations in differing climate regimes that would keep the Type I error rate uniform across the country. For example for sake of illustration, suppose the goal is to select *f* values which keep the potential Type I errors to about two percent. A representative set of stations and years can be pre-analyzed prior to QC to determine the *f* values appropriate to achieve this goal.The SRT method implicitly resolves the bias between variables at different stations induced by elevation difference or other attributes.

Tables 2 and 3 show the use of SRT (equations 3, 4 and 5 above). The data in the example are retrieved from the AWDN stations for the month of June 2011. Only one month was used in this example. The stations are located in the city of Lincoln, NE, USA. The station being tested is Lincoln 20E 35S and is labeled x while the neighboring stations are labeled y1, y2, y3, and y4. The slope (ai), interception (bi), and standard errors of the linear regression between the x and yi are computed. The non-biased estimation of x from data at neighboring stations (yi) are shown as x’1, x’2, x’3, and x’4. The values normalized s by the standard errors ( x’i/si^{2}) are used in equation 4 to create the estimation x(est). The last column shows the bias between the true X value and the estimated value (x(est)) from the four stations. We see that the sum of bias of the 30 days has a value of 0.00, which is expected because the estimates using the SRT method are un-biased. The standard error of this regression estimation is 0.83 F. Here, for instance, where f was chosen as 3, any value that is smaller than -2.5 F or larger than 2.5 F will be treated as an outlier. In this example no value of x-x(est) was marked as an outlier.

If one value or several values at the station x is missing, the x(est) will provide an estimate for the missing data entry (see Table 3). The example in Table 3 shows that the value of x is missing in June 10 and June 17, 2011, through the SRT method we can obtain the estimates as 67.4 F and 91.9 F for the two days independent of the true values of 66.2 F and 91.3 F with a bias of 1.2 F and 0.6 F, respectively. Here we note that the estimated values of the two days are slightly different than those estimated in Table 2 because there are 2 less values to include in the regression.

## 4. Providing estimates: robustness of SRT method and weakness of IDW method

The SRT method was tested against the Inverse Distance Weighted (IDW) method to determine the representativeness of estimates obtained (29). The SRT method outperformed the IDW method in complex terrain and complex microclimates. To illustrate this we have taken the data from a national cooperative observer site at Silver Lake Brighton, UT.The elevation at Silver Lake Brighton is 8740 ft. The nearest neighboring station is located at Soldier Summit at an elevation of 7486 ft. This data is for the year 2002. Daily estimates for maximum and minimum temperature were obtained for each day by temporarily removing the observation from that day and applying both the IDW (eq. 1) and the SRT (eq.2) methodsagainst 15 neighboring stations. The estimations for the SRT method were derived by applying the method (deriving the un-biased estimates) every 24 data.

Fig. 1 shows the result for maximum temperature at Silver Lake Brighton, Utah. The IDW approach results in a large bias. The best fit line for IDW indicates the estimates are systematically high by over 8 F (8.27); the slope is also greater than one (1.0684). When the best fit line for IDW estimates was forced through zero, the slope was 1.2152. On the other hand the estimates from the SRT indicate almost no bias as evidenced by the best-fit slope (0.9922).

For the minimum temperature estimates a similar result was found (Fig. 2). The slope of the best-fit line for the SRT indicates an unbiased (0.9931) while the slope for the IDW estimates indicates a large bias on the order of 20% (slope = 1.1933). The reader should note the SRT unbiased estimators are derived every 24 days (see ) and that applying the SRT only once for the entire period will degrade the results shown (7).

## 5. Techniques used to improve the quality control procedures during the extreme events.

Quality of data during the extreme events such as strong cold fronts and hurricanes may decrease resulting in a higher number of "true" outliers than that during the normal climate conditions. (28) carefully analyzed the sample examples of these extreme weather conditions to quantitatively demonstrate the causes of the outliers and then developed tools to reset the Type II error flags. The following discussion will elaborate on this technique.

### 5.1. Relationship between interval of measurement and QA failures

Analyses were conducted to prepare artificial max and min temperature records (not the measurements, but the values identified as the max and min from the hourly time series) for different times-of-observation from available hourly time series of measurements. The observation time for coop weather stations varies from site-to-site. Here we define the AM station, PM station, and nighttime station according to the time of observation (i.e. morning, afternoon-evening, and midnight respectively). The cooperative network has a higher number of PM stations but AM measurements are also common; the Automated Weather Data Network uses a midnight to midnight observation period.

The daily precipitation accumulates the precipitation for the past 24 hours ending at the time of observation. The precipitation during the time interval may not match the precipitation from nearby neighboring stations due to event slicing, i.e. precipitation may occur both before and after a station’s time of observation. Thus, a single storm can be sliced into two observation periods.

The measurements of the maximum and the minimum temperature are the result of making discrete intervals on a continuous variable. The maximum or minimum temperature takes the maximum value or the minimum value of temperature during the specific time interval. Thus the maximum temperature or the minimum temperature is not necessarily the maximum or minimum value of a diurnal cycle. Examples of the differences were obtained from three time intervals (see Fig 3) after28)). The hourly measurements of air temperature were retrieved from 1:00 March 11 to 17:00 March 13, 2002 at Mitchell, NE. The times of observation are marked. Point *A* shows the minimum air temperature obtained for March 11 for AM stations, and *B* is the maximum temperature obtained for March 13 at the PM stations. The minimum temperature may carry over to the following interval for AM stations and the maximum temperature may carry over to the following interval for PM stations. We have therefore marked these as problematic in Table 4to note that the thermodynamic state of the atmosphere will be represented differently for AM and PM stations. Through analysis of the time series of AM, PM and midnight calculated from the high quality hourly data we find that measurements obtained at the PM station have a higher risk of QA failure when compared to neighboring AM stations. The difference in temperature at different observation times may reach 20 ^{o}F for temperature and several inches for precipitation. Therefore the QA failures may not be due to sensor problems but, to comparing data from stations where the sensors are employed differently. To avoid this problem AM stations can be compared to AM stations, PM stations to PM stations, etc. Note this problem will be solved if modernization of network provides hourly or sub-hourly data at most station sites.

### 5.2. 1993 floods

Quality control procedures were applied to the data for the 1993 Midwest floods over the Missouri River Basin and part of the upper Mississippi River Basin, where heavy rainfall and floods occurred (28). The spatial regression test performs well and flags 5~7 % of the data for most of the area at *f*=3. The spatial patterns of the fraction of the flagged records do not coincide with the spatial pattern of return period. For example, the southeast part of Nebraska does not show a high fraction of flagged records although most stations have return periods of more than 1000 years. While, upper Wisconsin has a higher fraction of flagged records although the precipitation for this case has a lower return period in that area.

The analysis shows a significantly higher fraction of flagged records using AWDN stations in North Dakota than in other states. This demonstrates that the differences in daily precipitation obtained from stations with different times of observation contributed to the high fraction of QA failures. A high risk of failure would occur in such cases when the measurements of the current station and the reference station are obtained from PM stations and AM stations respectively. The situation worsens if the measurements at weather stations were obtained from different time intervals and the distribution of stations with different time-of-observation is unfavorable. This would be the case for an isolated AM or PM station.

Among the 13 flags at Grand Forks, 9 flags may be due to the different times of observation or perhaps the size and spacing of clouds (28). Four other flags occurred during localized precipitation events, in which only a single station received significant precipitation. Higher precipitation entries occurring in isolation are more likely to be identified as potential outliers. These problems were expected to be avoided by examining the precipitation over larger intervals, e.g. summing consecutive days into event totals.

### 5.3. 2002 drought events

No significant relationship is found between the topography and the fraction of flagged records. Some clusters of stations with high flag frequency are located along the mountains; however, other mountainous stations do not show this pattern. Moreover, some locations with similar topography have different patterns. For the State of Colorado, a high fraction of flags occurs along the foothills of the Rocky Mountains where the mountains meet the high plains. A high fraction was also found along interstate highways 25 and 70 in east Colorado. These situations may come about because the weather stations were managed by different organizations or different sensors were employed at these stations. These differences lead to possible higher fraction of flagged records in some areas.

Instrumental failures and abnormal events also lead to QA failures. Fig. 4 shows the time series of the Stratton Station in Color adooperated as part of the automated weather network. This station has nighttime (midnight) readings while all of the neighboring sites are AM or PM stations. Stratton thus has the most flagged records in the state (6): the highlighted records in Fig. 4 were flagged. We checked the hourly data time series to investigate the QA failure in the daily maximum temperature time series for the time period from April 20 to May 20, 2002. No value was found to support a Tmax of 88 for May 6 in the hourly time series, thus 88 ^{o}F appears to be an outlier. On May 7 a high of 85 ^{o}F is recorded for the PM station observation interval, in which the value of the afternoon of May 6 is recorded as the high on May 7. The 102 ^{o}F observation of May 8 at 6:00 AM appears to be an observation error caused by a spike in the instrument reading. The observation of 93 ^{o}F at 8:00 AM May 17 is supported by the hourly observation time series (see Fig. 4 (b)) and is apparently associated with a down burst from a decaying thunderstorm.

### 5.4. 1992 Andrew Hurricane

In Fig. 5 the evolution of the spatial pattern of flagged records from August 25 to August 28, 1992 during Hurricane Andrew and the corresponding daily weather maps shows a heavy pattern of flagging.. The flags in the spatial pattern figures are cumulative for the days indicated. The test shows that the spatial regression test explicitly marks the track of the tropical storm. Starting from the second land-fall of Hurricane Andrew at mid-south Louisiana, the weather stations along the route have flagged records. The wind field formed by Hurricane Andrew helps to define the influence zone of the hurricane on flags. Many stations without flags have daily precipitation of more than 2 inches as the hurricane passes, which confirms that the spatial regression test is performing reasonably well in the presence of high precipitation events.

### 5.5. Cold front in 1990

Flags for the cold front event during October, 1990 were examined. The maximum air temperature dropped by as much as 40 ^{o}F during the passage of the cold front. Spatial patterns of flags on October 6 coincide with the area traversed by the cold front and many stations were flagged in such states as North Dakota, South Dakota, Iowa, and Nebraska. On October 7, the cold front moved to southeast regions beyond Nebraska and Iowa. Of course nearby stations on opposite sides of the cold front may experience different temperatures thus leading to flags. This may be further complicated when different times of observation are involved. The cold front continues moving and the area of high frequency of flags also moves with the front correspondingly.

A similar phenomenon can be found in the test of the precipitation and the minimum temperature. A spatial regression test of any of these three variables can roughly mark the movements of the cold front events. The identified movements of the cold fronts and associated flagging of “good records” may lead to more manual work to examine the records. Simple pattern recognition tools have been developed to identify the spatial patterns of these flags and reset these flags automatically (see Fig. 6).

The spatial patterns of flagged records are significant for both the spatial regression test of the cold front events and the tropical storm events. However, most of these flagged records are type I errors, thus we tested a simple pattern recognition tool to assist in reducing these flags. Differences still exist between the distribution patterns of the flagged records for the cold front event and the tropical storm events due to the characteristics of cold front events and tropical storm events. These differences are:

Cold fronts have wide influence zones where the passages of the cold fronts are wider and the large areas immediately behind the cold front may have a significant flagged fraction of weather stations. The influence zones of the tropical storms are smaller where only the stations along the storm route and the neighboring stations have flags.

Cold fronts exert influences on both the air temperature and precipitation. The temperature differences between the regions immediately ahead of the cold fronts and regions behind can reach 10~20

^{o}C. The precipitation events caused by the cold fronts may be significant, depending on the moisture in the atmosphere during the passage. The tropical storms generally produce a significant amount of precipitation. A few inches of rainfall in 24 hours is very common along the track because the tropical storms generally carry a large amount of moisture.

### 5.6. Resetting the flags for cold front events and hurricanes

Some measurements during the cold front and the hurricane were valid but flagged as outliers due to the effect of QC tests during times of large temperature changes caused by the cold front passages and the heavy precipitation occurring in hurricanes. A simple spatial scheme was developed to recognize regions where flags have been set due to Type I errors. The stations along the cold front may experience the mixed population where some stations have been affected by the cold fronts and others have not. A complex pattern recognition method can be applied to identify the influence zone of the cold fronts through the temperature changes (e.g. using some methods described in Jain et al, 2000). In our work, we use the simple rule to reset the flag given that significant temperature changes occur when the cold front passes. The mean and the standard deviation of the temperature change can be calculated as:

where
*i*
^{th} station for the current day, *n* is the number of neighboring stations, and

where
*f’* takes a value of 3.0. The test results with this refinement for T_{max} are shown in Fig. 7 for Oct. 7, 1990. The results obtained using the refinements described in this section were labeled “modified SRT” and the results using the original SRT were labeled “original SRT” in Fig. 7 and 8. Of the 291 flags originally noted only 41 flags remain after the reset phase. The daily temperature drops more than 20 ^{o}F at most stations where the flags were reset and the largest drop is 55 ^{o}F.

For the heavy precipitation events, we compare the amount of precipitation at neighboring stations to see whether heavy precipitation occurred. We use a similar approach as for temperature to check the number of neighboring stations that have significant precipitation,

where the *p*
_{
i
} is the daily precipitation amount at a neighboring station, and *p*
_{
threshold
} is a threshold beyond which we recognize that a significant precipitation event has occurred at the neighboring station, e.g. 1 in. When
*p* is the precipitation amount of the current station, and *p*
_{
high
} is the upper threshold beyond which the threshold will flag the measurement. Fig.8 shows maps of flags after the reset process. Of the 78 flags originally noted only 41 flags remain after the reset phase. Most of the remaining flags are due to the precipitation being higher than the upper threshold.

Flags for the Andrew 1992 hurricane. The flags are the cumulative flags starting from Aug. 20 to Aug. 29, 1992. The flags by the modified SRT method overlay the flags by the original SRT method.

## 6. Multiple interval methods based on measurements from reference stations for precipitation.

One QC approach involved developing threshold quantification methods to identify a subset of data consisting of potential outliers in the precipitation observations with the aim of reducing the manual checking workload. This QC method for precipitation was developed based on the empirical statistical distributions underlying the observations.

The search for precipitation quality control (QC) methods has proven difficult. The high spatial and temporal variability associated with precipitation data causes high uncertainty and edge creep when regression-based approaches are applied. Precipitation frequency distributions are generally skewed rather than normally distributed. The commonly assumed normal distribution in QC methods is not a good representation of the actual distribution of precipitation and is inefficient in identifying the outliers.

The SRTmethod is able to identify many of the errant data values but the rate of finding errant values to that of making type I errors is conservatively 1:6. This is not acceptable because it would take excessive manpower to check all the flagged values that are generated in a nationwide network. For example, the number of precipitation observations from the cooperative network in a typical day is 4000. Using an error rate of 2% and considering the type I error rate indicates that several hundred values may be flagged, requiring substantial personnel resources for assessment.

(29) found the use of a single gamma distribution fit to all precipitation data was ineffective. A second test, the multiple intervals gamma distribution (MIGD) method, was introduced. It assumed that meteorological conditions that produce a certain range in average precipitation at surrounding stations will produce a predictable range at the target station. The MIGD method sorts data into bins according to the average of precipitation at neighboring stations; then, for the events in a specific bin, an associated gamma distribution is derived by fit to the same events at the target station. The new gamma distributions can then be used to establish the threshold for QC according to the user-selected probability of exceedance. We also employed the *Q* test for precipitation (20) using a metric based on comparisons with neighboring stations. The performance of the three approaches was evaluated by assessing the fraction of “known” errors that can be identified in a seeded error dataset(18). The single gamma distribution and *Q*-test approach were found to be relatively efficient at identifying extreme precipitation values as potential outliers. However, the MIGD method outperforms the other two QC methods. This method identifies more seeded errors and results in fewer Type I errors than the other methods.

### 6.1. Estimation of parameters for distribution of precipitation and thresholds from the Gama distribution

The Gamma distribution was employed to represent the distribution of precipitation. While other functions may provide a better overall fit to precipitation data our goal is to establish a reasonable threshold on values beyond which further checking will be required to determine if the value is an outlier or simply an extreme precipitation event. The precipitation events are fit to a Gamma distribution,
*γ*, *β* can be estimated from the precipitation events following (21) and (13),

where
*s* are the sample mean and the sample standard deviation, respectively.

The data for each station in the Gamma distribution test include all precipitation events on a daily basis for a year. The parameters for left-censored (0 values excluded) Gamma distributions, on a monthly basis, are also calculated, based on the precipitation events for individual months in the historical record. To ascertain the representativeness of the Gamma distribution, the precipitation value for the corresponding percentiles (*P*): 99, 99.9, 99.99, and 99.999% were computed from the Gamma distribution and compared with the precipitation values for given percentiles based on ranking (original data).

The criterion for a threshold test approach can be written as,

where*x*(*j,t*) is the observed daily precipitation on day *t* at station *j* and *I*(*p*) is the threshold daily precipitation for a given probability, *p* (=*P*/100), calculated using the Gamma distribution. A value not meeting this criterion is noted as a potential outlier (the shaded area to the right of the *p*=0.995 value for the distribution for all precipitation events in Fig. 9). The test function uses the one-sided test for precipitation, a non-negative variable.

### 6.2. Multiple interval range limit gamma distribution test for precipitation (MIGD)

Analysis has shown that precipitation data at a station can be fit to a Gamma distribution, which can then be applied to a threshold test approach. With this method only the most extreme precipitation events will be flagged as potential outliers so errant data at other points in the distribution are not identified.

The MIGD was developed to address these non-extreme points along the distribution. It assumes that meteorological conditions that produce a certain range in average precipitation at surrounding stations will produce a predictable range at the target station. Our concept is to develop a family of Gamma distributions for the station of interest and to selectively apply the distributions based on specific criteria. The average precipitation for each day is calculated for neighboring stations during a time period (e.g. 30 years). These values are ranked and placed into *n* bins with an equal number of values in each. The range for *n* intervals can be obtained from the cumulative probabilities of neighboring average time series, {0, *1/n, 2/n, …, n-1/n,* 1}. For the *i*
^{th} interval all corresponding precipitation values at the station of interest (target station) are gathered and parameters for the gamma distribution estimated. This process is repeated for each of the *n* intervals resulting in a family of Gamma curves (*G*
_{
i
}). The operational QC involves the application of the threshold test where the gamma distribution for a given day is selected from the family of curves based on the average precipitation for the neighboring stations. Each interval can be defined as
*i/n*, *i*=0 to *n-1*, and

Now for each precipitation event, *x*, at the station of interest, the neighboring stations’ average is calculated. If the average precipitation falls in the interval
*G*
_{
i
} is used to form a test:

where *p* is a probability in the range (0.5, 1), and the *G*
_{
i
}
*(p)* is the precipitation value for the given probability *p* in the gamma distribution associated with the *i*
^{
th
} interval. This equation forms a two sided test. Any value that does not satisfy this test will be treated as an outlier for further manual checking. The intervals and the estimation of this method were implemented using *R* statistical software (19).

The results indicate that the Gamma distribution is well suited for deriving appropriate thresholds for a particular precipitation event. The calculated extreme values provide a good basis for identifying extreme outliers in the precipitation observations. The inclusion of all precipitation events reduces the data requirements for the quantification of extreme events which generally requires a long time series of observations (e.g. using Gumbel distribution.) Using the approach based on the Gamma distribution, a suitable representation of the distribution of precipitation can be obtained with only a few years of observation, as is the case with newly established automatic weather stations, e.g. Climate Reference Network. Further study is required for probability selection in the Gamma distribution approach.

A simple gamma distribution can be fit to the daily precipitation values at a station. Upper thresholds can be set based on the cumulative probability of the precipitation distribution. This single gamma distribution (SGD) test will address the most extreme values of precipitation and flag them for further testing. However, to address non-extreme values of precipitation that are not out on the tail of the SGD another approach is needed. We have formulated the multiple interval gamma distribution test (MIGD) for this purpose. The main assumption is that the meteorological conditions that produce a certain range in average precipitation at surrounding stations will produce a predictable range of precipitation at the target station. It does not estimate the precipitation at the target station but estimates the range into which the precipitation should fit.

The average precipitation for each day is calculated for neighboring stations during a historical period, say 30 years. These values are then ranked and placed into n bins with an equal number of values in each. For all the values in a given bin, the daily precipitation at the target station are gathered and a gamma distribution formed. The process is repeated n times once for each bin resulting in a family of gamma distribution curves. A separate family of curves can be derived for each month or each season. In operation, the daily average of the precipitation at surrounding stations is calculated and used to point to the n’th gamma distribution which in turn provides thresholds against which to test for that day. For instance, the upper threshold can be selected to correspond with the cumulative probability for the n’th gamma distribution. The user is able to specify the threshold according the cumulative probability. For example we can be 99.5 % confident that values will not exceed the corresponding value on the cumulative probability curve. Values that exceed this are not necessarily wrong but flagged for further review. The MIGD will find more precipitation values that need to be reviewed than the single gamma distribution test.

Table 5 provides an example of the MIGD for n=5 at Tucson, AZ, USA. We update this type of information on an annual basis. If the precipitation value falls outside the q value of a selected confidence level, we mark the value as an outlier. For example, Suppose we select q999 for our confidence. The precipitation on August 2, 1987 was 1.3 inches while the average of neighboring stations had a value of 0.06 inches. The average falls between lower and upper in the 2^{nd} row, n=2. ie.0.05, 0.11. The rainfall value (1.3 inches) is larger than the q999 threshold (1.15 inches) thus we can say we are 99.9 % confident that the rainfall is an outlier and it should be flagged for further manual examination. Note that 1.3 inches is in no way an extreme precipitation value but, it's validity can be challenged on the basis of the MIGD test.

One other QC method for precipitation test is the Q-test (20). The Q-test approach serves as a tool to discriminate between extreme precipitation and outliers and it has proven to minimize the manual examination of precipitation by choice of parameters that identify the most likely outliers (20). The performance of both the Gamma distribution test and the Q-test is relatively weak with respect to identifying the seeded errors. The Q-Test is different from the Gamma distribution method because the Q-Test uses both the historical data and measurements from neighboring stations while the simple implementation of the Gamma distribution method only uses the data from the station of interest.

The MIGD method is a more complex implementation of the Gamma distribution that uses historical data and measurements from neighboring stations to partition a station’s precipitation values into separate populations. The MIGD method shows promise and outperforms other QC methods for precipitation. This method identifies more seeded errors and creates fewer Type I errors than the other methods. MIGD will be used as an operational tool in identifying the outliers for precipitation in ACIS. However, the fraction of errors identified by the MIGD method varies for different probabilities and among the different stations. Network operators, data managers, and scientist who plan to use MIGD to identify potential precipitation outliers can perform a similar analysis (sort the data into bins and derive the gamma distribution coefficients for each interval) over their geographic region to choose an optimum probability level.

## 7. Quality control of the NCDC dataset to create a serially complete dataset.

Development of continuous and high-quality climate datasets is essential to populate Web-distributed databases (17) and to serve as input to Decision Support Systems (e.g., 27).

Serially complete data are necessary as input to many risk assessments related to human endeavor including the frequency analysis associated with heavy rains, severe heat, severe cold, and drought. Continuous data are also needed to understand the climate impacts on crop yield, and ecosystem production. The National Drought Mitigation Center (NDMC) and the High Plains Regional Climate Center (HPRCC) at the University of Nebraska are developing a new drought atlas. The last drought atlas (1994) was produced with the data from 1119 stations ending in 1992. The forthcoming drought atlas will include additional stations and will update the analyses, maps, and figures through the period 1994 to the present time. A list was compiled from the Applied Climate Information System (ACIS) for stations with a length of at least 40 years of observations for all three variables: precipitation (PRCP), maximum (Tmax), and minimum (Tmin) temperatures. Paper records were scrutinized to identify reported, but previously non-digitized data to reduce, to the extent possible, the number of missing data. A list of 2144 stations was compiled for the sites that met the criterion of at least 40 years data with less than two months continuous missing gaps for at least one of the three variables. The remaining missing data in the dataset were supplemented by the estimates obtained from the measurements made at nearby stations. The spatial regression test (SRT) and the inverse distance weighted (IDW) method were adopted in a dynamic data filling procedure to provide these estimates. The replacement of missing values follows a reproducible process that uses robust estimation procedures and results in a serially complete data set (SCD) for 2144 stations that provide a firm basis for climate analysis. Scientists who have used more qualitative or less sophisticated quantitative QC techniques may wish to use this data set so that direct comparisons to other studies that used this SCD can be made without worry about how differences in missing dataprocedures would influence the results. A drought atlas based on data from the SCD will provide decision makers more support in their risk management needs.

After identifying stations with a long-term (at least 40 years) continuous (no data gaps longer than two months) dataset of Tmax, Tmin, and/or PRCP for a total of 2144 stations, the missing values in the original dataset retrieved from ACIS were filled to the extent possible with the keyed data from paper record and the estimates using the SRT and IDW methods. Two implementations of SRT were applied in this study. The short-window (60 days) implementation provides the best estimates based on the most recent information available for constructing the regression. The second implementation of SRT fills the long gaps, e.g. gaps longer than one month using the data available on a yearly basis. The IDW method was adopted to fill any remaining missing data after the two implementations of SRT.

This is the first serially complete data set where a statement of confidence can be associated with many of the estimates, ie. SRT estimates. The RMSE is less than 1F in most cases and thus we are 95% confident that the value, if available, would lie between ±2F of the estimate. This data set is available [1] - to interested parties and can be used in crop models, assessment of severe heat, cold, and dryness. Probabilities related to extreme rainfall for flooding and erosion potential can be derived along with indices to reflect impact on livestock production. The data set is offered as an option to distributing raw data to the users who need this level of spatial and temporal coverage but are not well positioned to spend time and resources to fill gaps with acceptable estimates.

Analysis based on the long-term dataset will best reveal the regional and large scale climatic variability in the continental U.S., making this an ideal data set for the development of a new drought atlas and associated drought index calculations. Future data observations can be easily appended to this SCD with the dynamic data filling procedures described herein.

## 8. Issues relating QC to gridded datasets,

Gridded datasets are sometimes used in QC but, we caution against this for the following reasons.New datasets created from inverse distance weighted methods or krigging suffer from uncertainties. The values at a grid point are usually not "true"measurements but are interpolated values from the measurements at nearby stations in theweather network.Thus, the values at the grid points are susceptible to bias. When further interpolation is made to a given location within the grid, bias will again exist at the specific location between the gridded values..Fig.10provides an example of potential bias. Outside of a gridded data set the target location would give a large weight to the value at station 5. However, if the radius used for the gridded data is as in the Fig.10, then the closest station to the target station (5) will not be included in the grid-based estimation.

## 9. Quality control of high temporal resolution datasets

The Oklahoma Mesonet (http://www.mesonet.org/) measures and archives weather conditions at 5-minute intervals (Shafer et al., 2000). The quality control system used in the network starts from the raw data of the measurements for the high temporal resolution data. A set of QC tools was developed to routinely maintain data of the Mesonet. These tools depend on the status of hardware and measurement flag sets built in the climate data system.The Climate Reference Network (CRN, Baker et al. 2004) is another example of the QC of high frequency data, which installs multiple sensors for each variable to guarantee the continuous operation of the weather station and thus the quality control can also rely on the multiple measurements of a single variable. This method is efficient to detect the instrumental failures or other disturbances; however the cost of such a network may be prohibitive for non-research or operational networks. The authors of this chapter also carried out QC on a high temporal resolution dataset in the Beaufort and Chukchi Sea regions. Surface meteorological data from more than 200 stations in a variety of observing networks and various stand-alone projects were obtained for the MMS Beaufort and Chukchi Seas Modeling Study (Phase II). Many stations have a relatively short period of record (i.e. less than 10 years).The traditional basic QC procedures were developed and tested for a daily data and found in need of improvement for the high temporal resolution data. In the modification, the time series of the maximum and the minimum were calculated from the high resolution data. The mean and standard deviation of the maximum and the minimum can then be calculated from the time series (e.g max and min temperatures) as the (u_{x}, s_{x}) and (u_{n}, s_{n}), respectively. The equation (6) using (u_{x} + f s_{x}) and (u_{n} - f s_{n}) forms limits defined by the upper limits of the maximum and lower limits of the minimum. The value falling outside the limits will be flagged as an outlier for further manual checking. Similarly, the diurnal change of a variable (e.g. temperature) was calculated from the high resolution (hourly or sub-hourly) data. The mean and standard deviation calculated from the diurnal changes will form the limits.

The traditional quality control methods were improved for examining the high temporal resolution data, to avoid intensive manual reviewing which is not timely or cost efficient. The identified problems in the dataset demonstrate that the improved methods did find considerable errors in the raw data including the time errors (e.g. month being great than 12). These newtools offer a dataset that, after manual checking of the flagged data, can be givin a statement of confidence. The level of confidence can be selected by the user, prior to QC.

The applied in-station limit tests can successfully identify outliers in the dataset. However, spatial tests based information from the neighboring stations is more robust in many cases and identifies errors or outliers in the dataset when strong correlation exists. The good relationship between the measurements at station pairs demonstrates that there is a potential opportunity to successfully apply the spatial regression test (SRT, 18) to the stations which measure the same variables (i.e. air temperature orwind speed). The short term measurements at some stations may not be efficiently QC’ed with only the three methods described in this work. One example is the dew point measurements at the first-order station Iultin-in-Chukot. More than 90 percent of the dew point measurements were flagged, because the parameters for QC’ing the variable used the state wide parameters which cannot reflect the microclimate of each station.

## 10. Summary and Conclusions

Quality control (QC) methods can never provide total proof that a data point is good or bad. Type I errors (false positives) or Type II errors (false negatives) can occur and result in labeling of good data as bad and bad data as good respectively. Decreasing the number of Type I and Type II errors is difficult because often a push to decrease Type I errors will result in an unintended increase inType II errors and vice versa. We have derived a spatial technique to introduce thresholds associated with user selected probabilities (i.e. select 99.7% as the level of confidence that a data value is an outlier before labeling it as bad and/or replacing it with an estimate). We base this technique on statistical regression in the neighborhood of the data in question and call it the Spatial Regression Test (SRT). Observations taken in a network are often affected by the same factors. In weather applications individual stations in a network are generally exposed to air masses in much the same way as are neighboring stations. Thus, temperatures in the vicinity move up and down together and the correlation between data in the same neighborhood is very high.Similarly seasonal forcings on this neighborhood (e.g. the day to day and seasonal solar irradiance) are essentially the same. We have defined a neighborhood for a station as those nearby stations that are best correlated to it. We found that the SRT method is an improvement over conventional inverse distance weighting estimates (IDW). A huge benefit of the SRT method is it’s ability to remove systematic biases in the data estimation process. Additionally, the method allows a user selected threshold on the probability as contrasted to the IDW. Although the SRT estimates are similar to IDW estimates over smooth terrain, SRT estimates are notably superior over complex terrain (mountains) and in the vicinity of other climate forcing (e.g. ocean/land boundaries). Gridded data sets that result from IDW, Kriging or most other interpolation schemes do not provide unbiased estimates. Even when grid spacing is decreased to a point where the complexity of the land surface is well represented there remains two problems: what is the microclimate of the nearest observation points and what is the transfer function between points. This is a future challenge for increasing the quality of data sets and the estimation of data between observation sites.