The time-mean analysis RMSE and the time-mean objective function values in 4 EnKF schemes for Lorenz-96 model when observational errors are spatially correlated and their covariance matrix is correctly specified: (1) EnKF (non-inflation); (2) the SLS scheme in Section 2.1 (SLS); (3) the SLS scheme in Section 2.2 (SLS and new structure); (4) the SLS scheme in the discussion (SLS and true ensemble forecast error). The forcing term *F* = 12.

## Abstract

The estimation accuracy of ensemble forecast errors is crucial to the assimilation results for all ensemble-based schemes. The ensemble Kalman filter (EnKF) is a widely used scheme in land surface data assimilation, without using the adjoint of a dynamical model. In EnKF, the forecast error covariance matrix is estimated as the sampling covariance matrix of the ensemble forecast states. However, past researches on EnKF have found that it can generally lead to an underestimate of the forecast error covariance matrix, due to the limited ensemble size, as well as the poor initial perturbations and model error. This can eventually result in filter divergence. Therefore, using inflation to further adjust the forecast error covariance matrix becomes increasingly important. In this chapter, a new structure of the forecast error covariance matrix is proposed to mitigate the problems with limited ensemble size and model error. An adaptive procedure equipped with a second-order least squares method is applied to estimate the inflation factors of forecast and observational error covariance matrices. The proposed method is tested on the well-known atmosphere-like Lorenz-96 model with spatially correlated observational systems. The experiment results show that the new structure of the forecast error covariance matrix and the adaptive estimation procedure lead to improvement of the analysis states.

### Keywords

- data assimilation
- ensemble Kalman filter
- error covariance inflation
- observation-minus-forecast residual
- least squares

## 1. Introduction

For state variables in geophysical research fields, a common assumption is that systems have “true” underlying states. Data assimilation is a powerful mechanism for estimating the true trajectory based on the effective combination of a dynamic forecast system (such as a numerical model) and observations [1]. It can produce an optimal combination of model outputs and observations [2]. The combined result is called analysis state, which should be closer to the true state than either the model forecast or the observation is. In fact, the analysis state can generally be treated as the weighted average of the model forecasts and observations, while the weights are approximately proportional to the inverse of the corresponding covariance matrices [3]. Therefore, the results of any data assimilation depend crucially on the estimation accuracy of the forecast and observational error covariance matrices [4]. If these matrices are estimated correctly, then the analysis states can be generated by minimizing an objective function which is technically straightforward and can be accomplished using existing engineering solutions [5], although finding the appropriate analysis state is still a quite difficult problem when the models are nonlinear [6, 7].

The ensemble Kalman filter (EnKF) is a widely used sequential data assimilation approach, which has been studied and applied since it is proposed by Evensen [8]. It is a practical ensemble-based assimilation scheme that estimates the forecast error covariance matrix using a Monte Carlo method with the short-term ensemble forecast states [9]. In EnKF, the forecast error covariance matrix is estimated as the sampling covariance matrix of the ensemble forecast states, which is usually underestimated due to the limited ensemble size and model error [10]. This finding indicates that the filter is over reliant on the model forecasts and excludes the observations. It may eventually lead to the divergence of the EnKF assimilation scheme [11, 12].

Therefore, the forecast error covariance inflation technique to address this problem becomes increasingly important. One of the error covariance matrix inflation techniques is additive inflation, in which a noise is added to the ensemble forecast states that sample the probability distribution of model error [13, 14]. Another widely used error covariance matrix inflation technique is multiplicative inflation, that is, to multiply the matrix by an appropriate factor. It can be used to mitigate filter divergence by inflating the empirical covariance and increasing the robustness of the filter [15].

In early studies of multiplicative inflation, researchers determine the inflation factor by repeated experimentation and choose a value according to their prior knowledge [11]. Hence, such experimental tuning is rather empirical and subjective. It is not appropriate to use the same inflation factor during all the assimilation procedure. Too small or too large an inflation factor will cause the analysis state to over rely on the model forecasts or observations and can seriously undermine the accuracy and stability of the filter. In later studies, the inflation factor is estimated online based on the observation-minus-forecast residual (innovation statistic) [16, 17] with different conditions.

Past work shows that moment estimation can facilitate the calculation by solving an equation of the observation-minus-forecast residual and its realization [18–20]. Maximum likelihood approach can obtain a better estimate of the inflation factor than moment approach, although it must calculate a high-dimensional matrix determinant [21–24]. Bayesian approach assumes a prior distribution for the inflation factor but is limited by spatially independent observational errors [25, 26]. Second-order least square estimation focus on minimizing the second-order least squares (SLS) [27] statistic of the squared observation-minus-forecast residual, which is not very expensive [28–30]. Generalized Cross Validation (GCV) [31, 32] can select a regularization parameter by minimizing the predictive data errors with rotation invariant in a least squares solution [33].

In practice, the observational error covariance matrix may also need to be adjusted, and an approach can be used to simultaneously optimize inflation factors of both forecast and observational error covariance matrices [21]. This approach is based on the optimization of the likelihood function of observation-minus-forecast residual. However, the likelihood function of observation-minus-forecast residual is nonlinear and involves the computationally expensive determinant and inverse of the residual covariance matrix. As compensation, the second-order least squares statistic of the squared observation-minus-forecast residual can be used as the cost function instead. The main advantage of the SLS cost function is that it is a quadratic function of the inflation factors, and therefore, the analytic forms of the estimators of the inflation factors can be easily obtained. Compared with the method based on maximum likelihood estimation method, the computational cost is significantly reduced.

Furthermore, unlike the sampling covariance matrix of the ensemble forecast states used in the conventional EnKF, a new structure of the forecast error covariance matrix is proposed in this chapter. In ideal situation, an ensemble forecast state is assumed as a random vector with the true state as its ensemble mean. Hence, it is should be defined that the ensemble forecast error is the ensemble forecast states minus true state rather than minus their ensemble mean [34]. This is because in a forecast model with large error and limited ensemble size, the ensemble mean of the forecast states can be very far from the true state. Therefore, the sampling covariance matrix of the ensemble forecast states can be very different from the true forecast error covariance matrix. As a result, the estimated analysis state can be substantially inaccurate. However, in reality, the true state is unknown, but the analysis state is a better estimate of the true state than the forecast state. Therefore, the information feedback from the analysis state can be used to revise the forecast error covariance matrix. In fact, the proposed forecast error covariance matrix is a combination of multiplicative and additive inflation. Bai and Li [14] also used the feedback from the analysis state to improve assimilation but in a different way.

This chapter consists of four sections. The EnKF scheme with a new structure of the forecast error covariance matrix and the adaptive estimation procedure is proposed in Section 2. The assimilation results on Lorenz model with a correlated observational system are presented in Section 3. Conclusions and discussion are given in Section 4.

## 2. Methodology

### 2.1. EnKF with SLS inflation scheme

Using the uniform notations for consistency, a nonlinear discrete-time forecast and linear observational system is written as [35]

where *i* is the time index; *n*-dimensional true state vector at time step *i*; *n*-dimensional analysis state vector which is an estimate of _{,} respectively. The goal of the EnKF assimilation is to find a series of analysis states

It is well-known that any EnKF assimilation scheme should include a forecast error inflation scheme. Otherwise, the EnKF may diverge [11]. A procedure for estimating multiplicative inflation factor of

Step 1. Calculate the perturbed forecast states

where *m* is the ensemble size).

Step 2. Estimate the improved forecast and observational error covariance matrices.

The forecast state

and the initial observational error covariance matrix is

There are several approaches for estimating the inflation factors

This leads to

(See Appendix A for detailed derivation). Similar to Wang and Bishop [19] and Li et al. [18], this procedure does not use Bayesian approach [20, 25, 26].

Step 3. Compute the perturbed analysis states.

where

### 2.2. EnKF with SLS inflation and new structure of forecast error covariance matrix

By Eqs. (1) and (3), the ensemble forecast error is defined as

Here, the observations can be used for improving the estimation accuracy of ensemble forecast error. The basic sense is as follows: After the analysis state

Step 2a. Use Step 2 in Section 2.1 to inflate the initial forecast error covariance matrix to *k* = 1.

Step 2b. Update the forecast error covariance matrix as

Then, adjust the forecast and observational error covariance matrices to

and

are estimated by minimizing the objective function.

If *k*-th updated analysis state as

set *k* = *k* + 1 and return back to Eq. (9); otherwise, take *i*-th time step and go to Step 3 in Section 2.1.

A general flowchart of the proposed assimilation scheme is shown in Figure 1. Moreover, the proposed forecast error covariance matrix (Eq. (9)) can be expressed as.

which is a multiplicatively inflated sampling error covariance matrix plus an additive inflation matrix (see Appendix B for the proof).

### 2.3. Notes

#### 2.3.1. Correctly specified observational error covariance matrix

If the observational error covariance matrix

This leads to a simpler estimate

#### 2.3.2. Validation statistics

In any toy model, the “true” state *i-*th step is defined as

where *k*-th components of the analysis state and true state at the *i*-th time step. In principle, a smaller RMSE indicates a better performance of the assimilation scheme.

## 3. Experiment on Lorenz 96 model

In this section, the EnKF with SLS inflation assimilation scheme is applied to a nonlinear dynamical system, which has properties relevant to realistic forecast problems: the Lorenz-96 model [39] with model error and a linear observational system. The performances of the assimilation schemes in Section 2 are evaluated through the following experiments.

### 3.1. Description of forecast and observational systems

The Lorenz-96 model [39] is a strongly nonlinear dynamical system with quadratic nonlinearity, which is governed by the equation.

where *k*, it is defined that

The true state is derived by a fourth-order Runge–Kutta time integration scheme [40]. The time step for generating the numerical solution was set at 0.05 nondimensional units, which is roughly equivalent to 6 hours in real time, assuming that the characteristic time-scale of the dissipation in the atmosphere is 5 days [39]. The forcing term was set as *F* = 8, so that the leading Lyapunov exponent implies an error-doubling time of approximately 8 time steps, and the fractal dimension of the attractor was 27.1 [41]. The initial value was chosen to be

In this study, the synthetic observations were assumed to be generated by adding random noises that were multivariate-normally distributed with mean zero and covariance matrix *j*-th and *k*-th grid points was as follows:

Since it can deal with spatially correlated observational errors, the scheme may potentially be applied for assimilating remote sensing observations and radiances data.

The model errors by changing the forcing term are added in the forecast model because it is inevitable in real dynamic systems. Thus, different values of *F* are chose in the assimilation schemes while retaining *F* = 8 when generating the “true” state. The observations are simulated every four time steps analogizing 1 day in realistic problem for 2000 steps to ensure robust results. The ensemble size is used as 30. The pre-determined threshold ^{5}. In most cases of the following experiment, the objective functions converge after 3–4 iterations, and the estimated analysis states also converge.

### 3.2. Comparison of assimilation schemes

In Section 2.1, the EnKF assimilation scheme with SLS error covariance matrix inflation is outlined. In Section 2.2, the improved EnKF assimilation scheme with the SLS error covariance matrix inflation and the new structure of the forecast error covariance matrix are summarized. In the following, the influences of these estimation methods on EnKF data assimilation schemes are assessed using Lorenz-96 model.

Lorenz-96 model is a forced dissipative model with a parameter *F* that controls the strength of the forcing (Eq. (18)). The model behaviors are quite different with different values of *F*, and chaotic systems are produced with integer values of *F* larger than 3. Therefore, several values of *F* are used to simulate a wide range of model errors. In all cases, the true states were generated by a model with *F* = 8. These observations were then assimilated into models with *F* = 4, 5, …, 12.

#### 3.2.1. Correctly specified observational error covariance matrix

Suppose the observational error covariance matrix

Figure 2 shows the time-mean analysis RMSE of the two assimilation schemes averaged over 2000 time steps, as a function of *F*. Overall, the analysis RMSE of the two assimilation schemes gradually grows as increasing model error. When *F* is near the true value 8, the two assimilation schemes have almost indistinguishable values of the analysis RMSE. However, when *F* becomes increasingly distant from 8, the analysis RMSE of the assimilation scheme with the new structure of the forecast error covariance matrix becomes progressively smaller than that of the assimilation scheme with the forecast error covariance matrix inflation only.

For the Lorenz-96 model with large error (such as, the case with *F* = 12), the time-mean analysis RMSE of the two assimilation schemes is listed in Table 1, as well as the time-mean values of the objective functions. The conventional EnKF assimilation scheme is also included for comparison. These results show clearly that our two schemes have significantly smaller RMSE than the EnKF assimilation scheme. Moreover, the assimilation scheme with the new structure of the forecast error covariance matrix performs much better than assimilation scheme with forecast error covariance matrix inflation only.

EnKF schemes | Time-mean RMSE | Time-mean L |
---|---|---|

Non-inflation | 5.65 | 2,298,754 |

SLS | 1.89 | 148,468 |

SLS and new structure | 1.22 | 38,125 |

SLS and true ensemble forecast error | 0.48 | 19,652 |

#### 3.2.2. Incorrectly specified observational error covariance matrix

In this section, the observational error covariance matrix is supposed to be correct only up to a constant factor. The factor is estimated using different estimation methods, and the corresponding assimilation results are evaluated.

The observational error covariance matrix *F* takes values 4, 5, …, 12 when assimilating observations, but *F* = 8 is used when generating the true states in all cases.

Figure 3 shows the time-mean analysis RMSE of the two cases averaged over 2000 time steps, as a function of forcing term. Generally speaking, the analysis RMSE of the two cases gradually grows as the increasing the model error. However, the analysis RMSE generated by using new structure of the forecast error covariance matrix (cases 2) is smaller than those by using the error covariance matrices inflation technique only (cases 1).

For the Lorenz-96 model with forcing term *F* = 12, the time-mean analysis RMSE of the two cases is listed in Table 2, along with the time-mean values of the objective functions. These results clearly show that when the observational error covariance matrix is incorrectly specified, the assimilation result is much better if the new structure of the forecast error covariance matrix is used (cases 2).

EnKF schemes | Ensemble size 30 | Ensemble size 20 | ||
---|---|---|---|---|

Time-mean RMSE | Time-mean L | Time-mean RMSE | Time-mean L | |

SLS | 2.43 | 1,426,541 | 3.51 | 1,492,685 |

SLS and new structure | 1.35 | 41,326 | 1.45 | 95,685 |

SLS and true ensemble forecast error | 0.58 | 21,585 | 0.60 | 21,355 |

The estimated

To further investigate the effect of ensemble size on the assimilation result, Figure 3 is reproduced with the ensemble size 20. The results are shown in Figure 5, as well as in Table 2. Generally speaking, Figures 5 is quite similar to Figure 3 but with larger analysis error. This indicates that the smaller ensemble size can lead to the larger forecast error and analysis error. The analysis is also repeated with the ensemble size 10. However in this case, both the inflation and new structure are not effective. This could be due to that the ensemble size 10 is too small to generate robust covariance estimation.

## 4. Discussion and conclusions

It is well-known that accurately estimating the error covariance matrix is one of the most key steps in any ensemble-based data assimilation. In EnKF assimilation scheme, the forecast error covariance matrix is initially estimated as the sampling covariance matrix of the ensemble forecast states. But due to limited ensemble size and model error, the forecast error covariance matrix is usually an underestimation, which may lead to the divergence of the filter. Therefore, the initially estimated forecast error covariance matrix is multiplied by an inflation factor

In fact, the true forecast error should be represented as the ensemble forecast states minus the true state. However, since in real problems, the true state is not available, the ensemble mean of the forecast states is used instead. Consequently, the forecast error covariance matrix is initially represented as the sampling covariance matrix of the ensemble forecast states. However, for the model with large error, the ensemble mean of the forecast states may be far from the true state. In this case, the estimated forecast error covariance matrix will also remain far from the truth, no matter which inflation technique is used.

To verify this point, a number of EnKF assimilation schemes with necessary error covariance matrix inflation are applied to the Lorenz-96 model but with the forecast state

However, since the true state

In the realistic problems, the observational error covariance matrix is not always correctly known, and hence it also needs to be adjusted too. Another factor

The second-order least squares function of the squared observation-minus-forecast residual can be a good objective function to quantify the goodness of fit of the error covariance matrix. The SLS method proposed in this chapter can be used to estimate the factors for adjusting both the forecast and observational error covariance matrices, while the first order method can only estimate the inflation factor of the forecast error covariance matrix. The SLS can also provide a criterion for stopping the iteration in the adaptive estimation procedure when the new structure of the forecast error covariance matrix is used. This is important for preventing the proposed forecast error covariance matrix to depart from the truth in the iteration. In most cases in this study, the minimization algorithms converge after several iterations, and the objective function decreases sharply. On the other hand, the improved forecast error covariance matrix indeed leads to the improvement of analysis state. In fact, as shown in Tables 1**-**2, a small objective function value always corresponds to a small RMSE of the analysis state.

The difference of the EnKF assimilation scheme with SLS inflation is compared to that with maximum likelihood estimation (MLE) inflation [21]. Generally speaking, the RMSE of the analysis state derived using the MLE inflation scheme is a little smaller than that derived using the SLS inflation scheme only but is larger than that derived using the SLS inflation with the new structure of forecast error covariance matrix. For instance, for Lorenz-96 model with forcing term *F* = 12, the RMSE is 1.69 for MLE inflation, 1.89 for SLS inflation only, and 1.22 for SLS inflation and new structure (Table 1). Whether this is a general rule or not is still unclear and is subject to further investigation. However, in MLE inflation scheme, the objective function is nonlinear and especially involves the determinant of the observation-minus-forecast residual’s covariance matrix, which is quite computationally expensive. The SLS objective function proposed in this chapter is quadratic, so its minimizer is analytic and can be easily calculated.

On the other hand, similar to other inflation schemes with single factor, this study also assumes the inflation factor to be constant in space. Apparently, this is not the case in many practical applications, especially for the cases that the observations are unevenly distributed. Persistently applying the same inflation values that are reasonably large to address problems in densely observed areas to all state variables can systematically overinflate the ensemble variances in sparsely observed areas [13, 26, 42]. Even when the adaptive procedure for estimating the error covariance matrix is applied, the problem may still exist in some extent. In the two case studies conducted here, the observational systems are relatively evenly distributed.

In the future study, we will investigate how to modify the adaptive procedure to suit the system with unevenly distributed observations. We also plan to apply our methodology to error covariance localization [43, 44] and to validate the proposed methodologies using more sophisticated dynamic and observational systems.

## Acknowledgments

This work is supported by the National Natural Science Foundation of China (grant no. 91647202), the National Basic Research Program of China (grant no. 2015CB953703), the National Natural Science Foundation of China (grant no. 41405098) and the Fundamental Research Funds for the Central Universities. The authors gratefully acknowledge the anonymous reviewers for their constructive and relevant comments, which helped greatly in improving the quality of this manuscript. The authors are also grateful to the editors for their hard work and suggestions on this manuscript. Parts of this chapter are reproduced from the authors’ previous publications [29, 30].

The forecast error covariance matrix

The covariance matrix of the random vector

where *E* is the expectation operator and

Since the forecast and observational errors are statistically independent, we have

From Eq. (2), *i*-th time step, and hence

Further, since the forecast state

Substituting Eqs (A3)–(A7) into Eq. (A2), we have

It follows that the second-order moment statistic of error

Therefore,

Similarly, if the amplitude of the observational error covariance matrix is not correct, we can adjust

As a bivariate function of

and

Setting Eqs (A12)–(A13) to zero and solving them lead to

In fact,

Since

and similarly.

That is, the last two terms of Eq. (B1) vanish. Therefore, the proposed forecast error covariance matrix can be expressed as